PCI Express and DMA

Started by SongDragon May 8, 2006
I am looking for some assistance writing a driver and FPGA code to handle 
DMA on a PCI Express system. The FPGA is a Xilinx V2P with a Xilinx x4 PCIe 
LogiCORE (v3.0).

I've scoured through the entire PCI Express Base Specification v2.0 (the 
Solari/Intel book) and DMA isn't mentioned once, as far as I can tell. I 
suppose it is at a higher level than the base spec covers. The Xilinx 
manuals don't mention it, either. I've also googled everywhere (websites, 
groups, etc.) for mention of PCI Express and DMA, to no avail.

Where should I go to find out how PCI Express handles DMA? What should the 
TLP messages look like? Are there any reference designs / sample code 
available?

I look forward to hearing from the community about this issue.

Thank you,

--Alex Gross 


In article 
<VbOdnQfqqqqJwsLZnZ2dneKdnZydnZ2d@comcast.com>, 
songdrgn@g.m.a.i.l.n0.spam.com says...

[ ... ]
 
> I've scoured through the entire PCI Express Base Specification v2.0 (the > Solari/Intel book) and DMA isn't mentioned once, as far as I can tell. I > suppose it is at a higher level than the base spec covers. The Xilinx > manuals don't mention it, either. I've also googled everywhere (websites, > groups, etc.) for mention of PCI Express and DMA, to no avail.
PCI (express or otherwise) doesn't really support DMA as such. Looking for bus mastering is much more likely to get you useful information. -- Later, Jerry. The universe is a figment of its own imagination.
"SongDragon" <songdrgn@g.m.a.i.l.n0.spam.com> wrote in message 
news:VbOdnQfqqqqJwsLZnZ2dneKdnZydnZ2d@comcast.com...
>I am looking for some assistance writing a driver and FPGA code to handle >DMA on a PCI Express system. The FPGA is a Xilinx V2P with a Xilinx x4 PCIe >LogiCORE (v3.0). > > I've scoured through the entire PCI Express Base Specification v2.0 (the > Solari/Intel book) and DMA isn't mentioned once, as far as I can tell. I > suppose it is at a higher level than the base spec covers. The Xilinx > manuals don't mention it, either. I've also googled everywhere (websites, > groups, etc.) for mention of PCI Express and DMA, to no avail. > > Where should I go to find out how PCI Express handles DMA? What should the > TLP messages look like? Are there any reference designs / sample code > available? > > I look forward to hearing from the community about this issue. > > Thank you, > > --Alex Gross
The DMA isn't done by the PCI express - it's done by the surrounding layers. The PCI, PCI-X, PCIe all have the ability to be a Master in a Burst transaction. For your FPGA to DMA to another system, the FPGA needs a request to master a transaction issued to the core. Once granted, the transaction will specify the location for the data transfer which has to be coordinated in your system, not in the PCIe core. The transaction can provide a complete payload or may be interrupted (at least in PCI/X land) to allow other higher-priority transactions to occur. Look at mastering transactions and post again with further questions.
SongDragon wrote:

> I am looking for some assistance writing a driver and FPGA code to > handle DMA on a PCI Express system. The FPGA is a Xilinx V2P with a > Xilinx x4 PCIe LogiCORE (v3.0).
Assuming the LogiCORE is capable of bus mastering, you need to instantiate a 'DMA controller' in your FPGA; either your own design or borrowed from another source. A 'DMA controller' can simply be a set of registers (sometimes referred to as 'descriptors') mapped into the host address space that allow the software to set a DMA transfer - source address, destination address, transfer size, control/status etc - hit a 'GO' bit, and generate an interrupt when it's done. If you want to get more fancy, add multiple channels, scatter-gather descriptors, request queuing, etc. From the back side of the PCIe core, all the DMA controller does is request the bus and issue a standard (burst in PCI-land) read/write to/from the source/destination addresses in the register. PCIe itself has no concept of 'DMA' - all it sees is another PCIe transfer. Exactly how you establish the transfer in the core is dependent on the backend interface of the LogiCORE. You shouldn't have to worry about the format of the TLP at all if there's a decent backend interface.
> Are there any reference designs / > sample code available?
A DMA controller IP core for PCI would still illustrate the concepts and give some insight into what you're up for. At the risk of muddying the waters further, there's a wishbone DMA core on opencores which can ultimately be used for PCI DMA transfers when connected to a PCI core (the opencores PCI core is a wishbone bridge so it bolts straight on). Might even be worth just looking at the doco for it. As for the driver, that will depend on what class of device you're implementing, especially if you're talking about windows. Your best bet there is to find an open-source/example driver for a similar device. If you're doing windows and need a 'grass-roots' high performance driver, prepare yourself for a frustrating and challenging time. Regards, -- Mark McDougall, Engineer Virtual Logic Pty Ltd, <http://www.vl.com.au> 21-25 King St, Rockdale, 2216 Ph: +612-9599-3255 Fax: +612-9599-3266
Thanks for the helpful responses from everyone.

The basic idea seems to be as follows:

1) device driver (let's say for linux 2.6.x) requests some kernel-level 
physical memory
2) device driver performs MEMWRITE32 (length = 1) to a register 
("destination descriptor") on the PCIe device, setting destination address 
in the memory
3) device driver performs MEMWRITE32 (length = 1) to a register ("length 
descriptor") on the PCIe device, setting length "N" (We'll say this also 
signals "GO")
4) PCIe device sends MEMWRITE32s (each length = up to 128 bytes at a time) 
to _______ (what is the destination?) until length N is reached
5) PCIe device sends interrupt (for now, let's say INTA ... it could be MSI, 
though)
6) device driver services interrupt and writes a zero to a register 
("serviced descriptor"), telling the PCIe device the interrupt has been 
fielded.

I have a number of questions regarding this. First and foremost, is this 
view of the transaction correct? Is this actually "bus mastering"? It seems 
like for PCIe, since there is no "bus", there is no additional requirements 
to handle other devices "requesting" the bus. So I shouldn't have to perform 
any bus arbitration (listen in to see if any of the other INT pins are being 
triggered, etc). Is this assumption correct?

In PCI Express, you have to specify a bunch of things in the TLP header, 
including bus #, device #, function #, and tag. I'm not sure what these 
values should be. If the CPU were requesting a MEMREAD32, the values for 
these fields in the MEMREAD32_COMPLETION response would would be set to the 
same values as were included in the MEMREAD32. However, since the PCIe 
device is actually sending out a MEMWRITE32 command, the values for these 
fields are not clear to me.


Thanks,

--Alex 


SongDragon wrote:

> 1) device driver (let's say for linux 2.6.x) requests some
(snip snip)
> writes a zero to a register ("serviced descriptor"), telling the PCIe > device the interrupt has been fielded.
> I have a number of questions regarding this. First and foremost, is > this view of the transaction correct? Is this actually "bus > mastering"? It seems like for PCIe, since there is no "bus", there is > no additional requirements to handle other devices "requesting" the > bus. So I shouldn't have to perform any bus arbitration (listen in to > see if any of the other INT pins are being triggered, etc). Is this > assumption correct?
Your description of events is pretty much correct. The exact registers and sequencing will of course depend on your implementation of a DMA controller. You'll need a source register too unless the data is being supplied by a FIFO or I/O "pipe" on the device. "Bus mastering" is a PCI term and refers to the ability to initiate a PCI transfer - which also implies the capability to request the bus. In PCIe nomenclature, an entity that can initiate a transfer is referred to as a "requestor" and you're right, there's no arbitration involved as such. But this is the equivalent of a PCI bus master I suppose. The target of the request is called the "completer". This is where my knowledge of PCIe becomes thinner, as I'm currently in the process of ramping up for a PCIe project myself. But I have worked on several PCI projects so I think my foundations are valid. For example, using a (bus-mastering) PCI core you wouldn't have to 'worry about' requesting the bus etc - initiating a request via the back-end of the core would trigger that functionality in the core transparently for you. As far as your device is concerned, you have "exclusive" use of the bus - you may just have to wait a bit to get to use it (and you may get interrupted occasionally). Arbitration etc is not your problem.
> In PCI Express, you have to specify a bunch of things in the TLP > header, including bus #, device #, function #, and tag. I'm not sure > what these values should be. If the CPU were requesting a MEMREAD32, > the values for these fields in the MEMREAD32_COMPLETION response > would would be set to the same values as were included in the > MEMREAD32. However, since the PCIe device is actually sending out a > MEMWRITE32 command, the values for these fields are not clear to me.
This is where I'll have to defer to others... Regards, -- Mark McDougall, Engineer Virtual Logic Pty Ltd, <http://www.vl.com.au> 21-25 King St, Rockdale, 2216 Ph: +612-9599-3255 Fax: +612-9599-3266
Mark McDougall wrote:

> SongDragon wrote: > >> 1) device driver (let's say for linux 2.6.x) requests some
BTW if you're writing Linux device drivers as opposed to Windows drivers, you're in for a *much* easier ride! :) Regards, -- Mark McDougall, Engineer Virtual Logic Pty Ltd, <http://www.vl.com.au> 21-25 King St, Rockdale, 2216 Ph: +612-9599-3255 Fax: +612-9599-3266
easier ride? how much easier?

just yesterday I wrote test application that allocates system dma
buffer and sends the physical address of it to the pci target that then
starts master transaction.

the PCI backend logic needed about 20 lines of verilog
for the WinXP test application I wrote about 15 lines of Delphi code

you say on linux it would be easier?

well if you have linux box in your bedroom then maybe :)

Antti
PS actually linux device drivers are fun, I agree, but quick dirty
direct hardware programming on WinXP is simple as well.

Antti wrote:

> easier ride? how much easier?
As I said: >> If you're doing windows and need a 'grass-roots' high performance >> driver, prepare yourself for a frustrating and challenging time. > PS actually linux device drivers are fun, I agree, but quick dirty > direct hardware programming on WinXP is simple as well. There's several options these days to make life a lot easier on Windows, for example the Jungo tools, TVICPCI etc. But to some extent it depends on what type of driver you're writing, what performance you need, and what versions of windows you need to support. A big part of the time/effort is simply ramping up on windows device drivers - working out what *type* of driver you need to write (is it WDM? native kernel mode? VxD? upper/lower filter? HID?) - sometimes you even need 2 drivers! - and how it fits into the whole mess. Years and years ago I spent *months* writing a SCSI miniport driver for 95/NT4/2K, which included support calls to M$. Once I'd finished, it took me 3 days get a basic port running on Linux, and I'd *never* written a Linux device driver before. Regards, -- Mark McDougall, Engineer Virtual Logic Pty Ltd, <http://www.vl.com.au> 21-25 King St, Rockdale, 2216 Ph: +612-9599-3255 Fax: +612-9599-3266
Alex,

I was wondering if you made anymore progress with the PCI Express DMA problem. I
have a similar problem but it is concerning bursting of data from the host to the
Endpoint. My Windows Driver sets up a buffer of data to be sent to the endpoint and
initiates a block transfer. The chipset, however, breaks this block into multiple
single DW transfers effectively killing performance. I believe that allowing the
Endpoint to become the bus master and initiate block transfers by reading from the
allocated buffer on the host will lead to better bus utilization. Do you have any
ideas about this or any updates on your progress with DMA?

Thanks --Kevin