FPGARelated.com
Forums

high bandwitch ethernet communication

Started by eliben September 5, 2007
On Wed, 05 Sep 2007 18:06:16 -0700, Janaka <janakas@optiscan.com>
wrote:

>We've are using the MPC8349E at 400Mhz core and got only 480mbit/s >sustained UDP data rate. These processors are marketed as >communications processors but only have low level HW support (on >Ethernet layer). All the upper level IP and UDP protocols are handled >in software (when running linux). So it takes up CPU time. Same >setup on two desktop PCs running linux yeilds 840mbit/s sustained UDP >rate.
If the OP required only something dedicated point to point connectivity, why bother with the IP wrapper, just send raw Ethernet frames with MAC addressing ? Apparently that MPC has some modern version of the QUICC co-processor (as found on the MC68360), in which it is quite easy to set up one BD (buffer descriptor) for the (possibly fixed) header and an other for the actual data. The co-processor assembles the frames from the fragments, sends them autonomously, appends the CRC and then search for next ready frame to be sent, without any further main processor intervention. The hard thing is to get the transmit data into the transmit buffers fast enough, but for direct port to port copying, there should not be much need to move the actual data in the memory. Paul
I suggest that you go back and read John McCaskill's response.
It would probably help to discuss things with a network wizard.
You want a low level protocol geek, not a web designer.
(especially one who knows something about hardware)

I think the real question is what happens when a packet
gets lost?  If you are using TCP, you have to buffer
all the data until it gets ACKed.  If you are using UDP,
you drop some data.

UPD in send-only mode doesn't really require a stack.


>So where is the actual UDP communication implemented ? In Linux ? What >processor is it running on ? Is it an external CPU or the built-in PPC >of Virtex 4 ?
If I was doing this (or something like what I think you are doing), I would try to do all the UDP in the FPGA. The header is just a bunch of constants. You probably want a sequence number in your payload. Then you have to compute the CRC. The whole thing is well specified and you can be sure it will run fast enough as long as the network doesn't get congested. No ACKs, just fire and forget. An alternative approach is to get the data into a PC somehow, and do the UDP/whatever work from that PC. As somebody else already suggested, one "easy" way to get the data into a PC would be to use Ethernet on a point to point link. You don't even need a CRC. This is easy for the hardware. The software guys might not like it. They have to steal the ethernet port from the software stack and write a driver. You might look at tcpdump and see how it handles packets with CRC errors. There are various PCI boards with an FPGA on them. If you can get your data on to one of those cards, then you can DMA it into memory. That still needs software but it is a slightly different type of software. -- These are my opinions, not necessarily my employer's. I hate spam.
On Sep 6, 9:52 am, Paul Keinanen <keina...@sci.fi> wrote:
> On Wed, 05 Sep 2007 18:06:16 -0700, Janaka <jana...@optiscan.com> > wrote: > > >We've are using the MPC8349E at 400Mhz core and got only 480mbit/s > >sustained UDP data rate. These processors are marketed as > >communications processors but only have low level HW support (on > >Ethernet layer). All the upper level IP and UDP protocols are handled > >in software (when running linux). So it takes up CPU time. Same > >setup on two desktop PCs running linux yeilds 840mbit/s sustained UDP > >rate. > > If the OP required only something dedicated point to point > connectivity, why bother with the IP wrapper, just send raw Ethernet > frames with MAC addressing ? >
I wondered about that, actually. But working on the MAC level is very inflexible. For example: 1) What if the client computer gets replaced by an equivalent computer. Each NIC has a unique MAC address, and so I'll have to reconfigure my sender, or set up some manual MAC discovery protocol. 2) If the client is a PC of some sort, working on the MAC packet level isn't too simple, as the networking APIs don't provide that level. A separate driver to the NIC should be used, or whatever. 3) If I want to advance to a more complicated network, such as one with a few clients, working on the IP level is much more convenient as I can set up a router with all the niceties it brings - multicasts, groups, etc. Eli
On Thu, 06 Sep 2007 11:19:09 -0000, eliben <eliben@gmail.com> wrote:

>On Sep 6, 9:52 am, Paul Keinanen <keina...@sci.fi> wrote:
>> If the OP required only something dedicated point to point >> connectivity, why bother with the IP wrapper, just send raw Ethernet >> frames with MAC addressing ? >> > >I wondered about that, actually. But working on the MAC level is very >inflexible. For example: > >1) What if the client computer gets replaced by an equivalent >computer. Each NIC has a unique MAC address, and so I'll have to >reconfigure my sender, or set up some manual MAC discovery protocol.
The "manual MAC discovery protocol" could be ARP, which is simple to implement (manually creating the request IP header) and you get the MAC address of the other partner. After that, you do not have to bother about any IP addresses in the message headers in the actual high speed data transfers. Only if you send the data to some hot standby redundant system, in which the MAC address can change at any time, but again, you just would have to repeat the ARP protocol query.
>2) If the client is a PC of some sort, working on the MAC packet level >isn't too simple, as the networking APIs don't provide that level. A >separate driver to the NIC should be used, or whatever.
I haven't written any raw Ethernet protocols in two decades, but in those days setting the receiver into Promiscuous mode was all that was needed. I still assume that the current Ethernet card support the Promiscuous mode, since there are a lot of Ethernet and TCP/UDP/IP analysing programs working with standard Ethernet adapters. Are these analysing programs using some dedicated driver stacks ? With the cost of the system that the OP asked, there would not be a cost issue of installing an extra network cards on the receiving PC. Thus, one NIC could handle the fast traffic in Promiscuous mode, while the other NIC(s) could handle ordinary network traffic.
>3) If I want to advance to a more complicated network, such as one >with a few clients, working on the IP level is much more convenient as >I can set up a router with all the niceties it brings - multicasts, >groups, etc.
MAC broadcasts work well with hubs. This kind of MAC broadcast is used in some producer/consumer model Ethernet based industrial networks these days. Paul
On Thu, 06 Sep 2007 05:20:14 -0000, eliben <eliben@gmail.com> wrote:

>> My first choice would be some other, more embeddable, processor running >> off to the side. A PowerPC from Freescale, or an ARM processor from just >> about anybody, comes to mind. I suspect that even a modest such processor >> would get some pretty high speeds if that's all it was doing. >> >> You may have to bite the bullet and write your own stack that's shared >> between a processor and the FPGA. I know practically nothing about >> TCP/IP, but I'm willing to bet that once you've signed up to writing >> or modifying your own stack there are some obvious things to put into the >> FPGA to speed things up. > >Thanks, this is the design I'm now leaning towards. However, I have a >concern regarding the high-speed communication between the FPGA and >the outside Processor. How is it done - using some high speed outside >DDR memory ?
If the FPGA was a Virtex-IIPro, or V4FX or so, the PowerPC wouldn't be external. Mind you, after designing logic, anything running on the PowerPC seems painfully slow... - Brian
On 2007-09-06, eliben <eliben@gmail.com> wrote:

> 2) If the client is a PC of some sort, working on the MAC packet level > isn't too simple,
It's easy: $ man packet
> as the networking APIs don't provide that level.
Sure they do. See above. Of course it sucks trying to do it under Windows, but it sucks trying to do _anything_ under Windows. ;)
> A separate driver to the NIC should be used, or whatever. > > 3) If I want to advance to a more complicated network, such as one > with a few clients, working on the IP level is much more convenient as > I can set up a router with all the niceties it brings - multicasts, > groups, etc.
Yup. One of the products I work on started out with MAC level networking. It's fast and has very low overhead, but there are always going to be customers who want IP networking. So now the product will do either MAC networking or TCP networking (or both, actually). Beware of relying on the Ethernet CRC. I've run across two different uController/MAC combinations where it wasn't reliable. -- Grant Edwards grante Yow! Wait ... is this a FUN at THING or the END of LIFE in visi.com Petticoat Junction??
On 2007-09-06, Paul Keinanen <keinanen@sci.fi> wrote:

>>2) If the client is a PC of some sort, working on the MAC packet level >>isn't too simple, as the networking APIs don't provide that level. A >>separate driver to the NIC should be used, or whatever. > > I haven't written any raw Ethernet protocols in two decades, but in > those days setting the receiver into Promiscuous mode was all that was > needed.
There's no need for promiscuous mode. None of the MAC packet level products I've worked on used promiscuous mode at all.
> With the cost of the system that the OP asked, there would not be a > cost issue of installing an extra network cards on the receiving PC. > Thus, one NIC could handle the fast traffic in Promiscuous mode, while > the other NIC(s) could handle ordinary network traffic.
I don't see what promiscuous mode has to do with it. The MAC level protocols I worked with were all still unicast. -- Grant Edwards grante Yow! FOOLED you! Absorb at EGO SHATTERING impulse visi.com rays, polyester poltroon!!
On Sep 6, 12:22 am, eliben <eli...@gmail.com> wrote:
> > If you have the choice between UDP and TCP, UDP is much simpler and > > fits an FPGA well. The big issue in choosing between the two is if you > > require the guaranteed delivery of TCP, or can tollerate the potential > > packet loss of UDP. > > > As an example, we make a card that acquires real time data in a custom > > protocol that is wrapped in UDP. We use a Xilinx Virtex-4 FX60, and a > > protocol offload engine that uses the Xilinx PicoBlaze soft processor > > to deal with the protocol stack. The PicoBlaze is an 8-bit soft > > processor. It looks at each incomming packet and reads the header to > > see if it is one of the real time streams we are trying to offload. > > If it is, it sends the header to one circular buffer in memory and the > > data to another circular buffer. If it is not, it sends it to a > > kernel buffer and we let the Linux network stack deal with it. > > > With this setup, we can consume data at over 90 MB/sec per Gigabit > > Ethernet port. The data part of the packet is 1024 bytes, and each > > GigE port has its own PicoBlaze dedicated to it. > > So where is the actual UDP communication implemented ? In Linux ? What > processor is it running on ? Is it an external CPU or the built-in PPC > of Virtex 4 ? > >
We are using one of the embedded PowerPCs to run Linux, and one PicoBlaze soft processor per EMAC in the design. As each packet exits the EMAC, its header is examined by software that is running on the PicoBlaze. The PicoBlaze is running a very simple stack written in assembly language. As it looks at each layer of the header, it makes a decision to do one of several things. At the Ethernet level, it is deciding if it should just throw the packet away, or pass it on to the next layer. At the IP and UDP layers, it is deciding if the packet belongs to the protocol that we are offloading, and is a stream that we have requested. If the packet does belong to the protocol that we are offloading, the PicoBlaze sets up a PLB DMA engine to send it down one data path. If the packet does not belong to the protocol we are offloading, then the PicoBlaze sends it down another data path and it is given to the Linux kernel to deal with. So for the protocol that we are offloading, the entire Ethernet, IP, UDP, and custom protocol stack are implemented in PicoBlaze assembly code. For everything else, the stack is in Linux. Since the data we are offloading is multicast data, this lets us have the PicoBlaze deal with the simple but high speed UDP packets, and the PowerPC running Linux deal with the IGMP messaging required to join, leave and maintain membership in a multicast group. The PicoBlaze is just running a single threaded loop of code that polls for input or output data, and then deals with it. Its program is loaded from the PowerPC taking advantage of the fact that the BRAM holding its program is dual ported. The PowerPC also tells the PicoBlaze what streams of data it is supposed to be acquiring, and what buffers in DDR2 memory to write the data to. The PicoBlaze will generate an interrupt once it has written a certain number of packets to the buffer, so the PowerPC just sees big chunks of data showing up in the buffers and does not have to deal with each packet. The PowerPC only needs to spend 1 or 2 percent of its compute to deal with acquiring each stream of data.
> > > I did notice that you want to send GigE instead of receive it like we > > are doing, but this method should work for sending a custom protocol > > wrapped in UDP with some minor changes. > > > How is the GigE that you are sending the data over connected? Is it > > point to point, a dedicated network, or something else? > > We can assume for the sake of discussion that it is point to point, > since the network is small and we're likely to use a fast switch to > ensure exclusive links between systems. > > Eli
With this setup, you should be able to have an error rate that is incredibly low. As long as the possibility of a lost packet is not catastrophic, UDP would be a very good match. The protocol we offload has sequence numbers in it, so we know if we have lost a packet. The data is being used for realtime signal processing, so losing a packet just looks like a burst of noise. Regards, John McCaskill www.fastertechnology.com
On Sep 6, 4:45 pm, John McCaskill <jhmccask...@gmail.com> wrote:
> On Sep 6, 12:22 am, eliben <eli...@gmail.com> wrote: > > > > > > If you have the choice between UDP and TCP, UDP is much simpler and > > > fits an FPGA well. The big issue in choosing between the two is if you > > > require the guaranteed delivery of TCP, or can tollerate the potential > > > packet loss of UDP. > > > > As an example, we make a card that acquires real time data in a custom > > > protocol that is wrapped in UDP. We use a Xilinx Virtex-4 FX60, and a > > > protocol offload engine that uses the Xilinx PicoBlaze soft processor > > > to deal with the protocol stack. The PicoBlaze is an 8-bit soft > > > processor. It looks at each incomming packet and reads the header to > > > see if it is one of the real time streams we are trying to offload. > > > If it is, it sends the header to one circular buffer in memory and the > > > data to another circular buffer. If it is not, it sends it to a > > > kernel buffer and we let the Linux network stack deal with it. > > > > With this setup, we can consume data at over 90 MB/sec per Gigabit > > > Ethernet port. The data part of the packet is 1024 bytes, and each > > > GigE port has its own PicoBlaze dedicated to it. > > > So where is the actual UDP communication implemented ? In Linux ? What > > processor is it running on ? Is it an external CPU or the built-in PPC > > of Virtex 4 ? > > We are using one of the embedded PowerPCs to run Linux, and one > PicoBlaze soft processor per EMAC in the design. > > As each packet exits the EMAC, its header is examined by software that > is running on the PicoBlaze. The PicoBlaze is running a very simple > stack written in assembly language. As it looks at each layer of the > header, it makes a decision to do one of several things. At the > Ethernet level, it is deciding if it should just throw the packet > away, or pass it on to the next layer. At the IP and UDP layers, it is > deciding if the packet belongs to the protocol that we are offloading, > and is a stream that we have requested. If the packet does belong to > the protocol that we are offloading, the PicoBlaze sets up a PLB DMA > engine to send it down one data path. If the packet does not belong to > the protocol we are offloading, then the PicoBlaze sends it down > another data path and it is given to the Linux kernel to deal with. > > So for the protocol that we are offloading, the entire Ethernet, IP, > UDP, and custom protocol stack are implemented in PicoBlaze assembly > code. For everything else, the stack is in Linux. Since the data we > are offloading is multicast data, this lets us have the PicoBlaze deal > with the simple but high speed UDP packets, and the PowerPC running > Linux deal with the IGMP messaging required to join, leave and > maintain membership in a multicast group. > > The PicoBlaze is just running a single threaded loop of code that > polls for input or output data, and then deals with it. Its program > is loaded from the PowerPC taking advantage of the fact that the BRAM > holding its program is dual ported.
At what frequency does the PicoBlaze run ? It must be pretty fast to deal with packets at this bandwidth. Or is the fact that it's only examining the frame headers saves you from the need of high speed ? Thanks Eli
On Sep 6, 1:25 pm, eliben <eli...@gmail.com> wrote:
> On Sep 6, 4:45 pm, John McCaskill <jhmccask...@gmail.com> wrote: > > > > > On Sep 6, 12:22 am, eliben <eli...@gmail.com> wrote: > > > > > If you have the choice between UDP and TCP, UDP is much simpler and > > > > fits an FPGA well. The big issue in choosing between the two is if you > > > > require the guaranteed delivery of TCP, or can tollerate the potential > > > > packet loss of UDP. > > > > > As an example, we make a card that acquires real time data in a custom > > > > protocol that is wrapped in UDP. We use a Xilinx Virtex-4 FX60, and a > > > > protocol offload engine that uses the Xilinx PicoBlaze soft processor > > > > to deal with the protocol stack. The PicoBlaze is an 8-bit soft > > > > processor. It looks at each incomming packet and reads the header to > > > > see if it is one of the real time streams we are trying to offload. > > > > If it is, it sends the header to one circular buffer in memory and the > > > > data to another circular buffer. If it is not, it sends it to a > > > > kernel buffer and we let the Linux network stack deal with it. > > > > > With this setup, we can consume data at over 90 MB/sec per Gigabit > > > > Ethernet port. The data part of the packet is 1024 bytes, and each > > > > GigE port has its own PicoBlaze dedicated to it. > > > > So where is the actual UDP communication implemented ? In Linux ? What > > > processor is it running on ? Is it an external CPU or the built-in PPC > > > of Virtex 4 ? > > > We are using one of the embedded PowerPCs to run Linux, and one > > PicoBlaze soft processor per EMAC in the design. > > > As each packet exits the EMAC, its header is examined by software that > > is running on the PicoBlaze. The PicoBlaze is running a very simple > > stack written in assembly language. As it looks at each layer of the > > header, it makes a decision to do one of several things. At the > > Ethernet level, it is deciding if it should just throw the packet > > away, or pass it on to the next layer. At the IP and UDP layers, it is > > deciding if the packet belongs to the protocol that we are offloading, > > and is a stream that we have requested. If the packet does belong to > > the protocol that we are offloading, the PicoBlaze sets up a PLB DMA > > engine to send it down one data path. If the packet does not belong to > > the protocol we are offloading, then the PicoBlaze sends it down > > another data path and it is given to the Linux kernel to deal with. > > > So for the protocol that we are offloading, the entire Ethernet, IP, > > UDP, and custom protocol stack are implemented in PicoBlaze assembly > > code. For everything else, the stack is in Linux. Since the data we > > are offloading is multicast data, this lets us have the PicoBlaze deal > > with the simple but high speed UDP packets, and the PowerPC running > > Linux deal with the IGMP messaging required to join, leave and > > maintain membership in a multicast group. > > > The PicoBlaze is just running a single threaded loop of code that > > polls for input or output data, and then deals with it. Its program > > is loaded from the PowerPC taking advantage of the fact that the BRAM > > holding its program is dual ported. > > At what frequency does the PicoBlaze run ? It must be pretty fast to > deal with packets at this bandwidth. Or is the fact that it's only > examining the frame headers saves you from the need of high speed ? > > Thanks > Eli
We run the EMAC 8 bits wide at 125 MHz, and the PicBlaze at 62.5 MHz using a divided by two version of the EMAC clock. The PicoBlaze takes two cycles per instruction, and the packets we are offloading are a bit over 1KB, so we about 512 instructions to deal with an offloaded packet and other overhead. Dealing with a non-offloaded packet takes the shortest path through the code to keep the number of packets per second we can handle up. The network the data is on is tightly controlled, so there is very little on it that is not the protocol we are offloading, mostly just IGMP packets for dealing with the multicast groups, and they are at a very low rate. You are correct in that we do not look at the entire packet with the PicoBlaze, just the header. Once it has determined that it wants to offload that packet, it then has a little bit more work to do to calculate addresses and load them into the DMA engine. To make sure that we do not drop packets, we just need to make sure that the longest path through the code takes less time than about how long it takes to receive a packet. We have a FIFO between the EMAC and the DMA engine, so we can smooth things out a bit. Regards, John McCaskill www.fastertechnology.com