FPGARelated.com
Forums

Xilinx QUIZ 2008

Started by Antti December 30, 2008
Xilinx QUIZ 2008

system setup

* Xilinx Virtex FPGA
- DDR2 memory
- SFP sockets on MGTs
- Gigabit TEMAC with SG DMA

1G fiber module in the SFP,
fiber to D-Link media adapter
cable to D-link GB switch
cable to PC GB ethernet port
PC is running only 1 custom application

FPGA is sending UDB packets to PC
PC sends very little amount of small UDP
packets that are responded by FPGA

Problem:
UDP commands are no longed processed
or responded by the FPGA after say
15 minutes after communication start.
the time is dependant off the PC and
the application running, there it may
be sometimes several hours before the
communication stops.


Yesterday I did think I solved the problem:
the RX buffers were not aligned properly
so I assumed that could have caused the
problem for the SG DMA.

But after fixing this, the problem persisted.

The PPC is not running wild, neither is
there spurios reset coming, the main loop
is still working, and the interrupts as well.

But the DMA registers after the failure
are written with either 0, random or wrong
values.

I am troubleshooting this system for some time
already, had many great ideas what all could
have been the cause for the problem, but non
of them made any change.

Hum... adding single UART char debug symbols
made it NOT TO FAIL (or maybe i did not
wait long enough) so i removed those debug
printouts, to make the problem visible so
it can be better seen.

I have mini uart debug routine built in
so i can type commands to read memory
and DCR bus whenever i want while the
system is running.

So I see the DMA regs being corrupted but
that doesnt give much hints how or why?

It looks like TX BD address value has
been written to RX LEN register, other
regs are either 0, or completly random.

Any body dare to propose a solution?

Yesterday i belived the answer to be: ALIGN
but i was wrong.

Antti
Antti wrote:
> Xilinx QUIZ 2008 > > system setup > > * Xilinx Virtex FPGA > - DDR2 memory > - SFP sockets on MGTs > - Gigabit TEMAC with SG DMA >
Ok, what I can remember from running in Virtex-4FX issues (information from about one year ago, so maybe that has changed meanwhile) with ethernet: SG-DMA used to be buggy. HardTemac-Version used to be Silicon-Revision dependend (and was not selected properly automatically) Maybe that helps a little. Regards, Lorenz
On Dec 30, 2:11=A0pm, Lorenz Kolb <lorenz.k...@uni-ulm.de> wrote:
> Antti wrote: > > Xilinx QUIZ 2008 > > > system setup > > > * Xilinx Virtex FPGA > > - DDR2 memory > > - SFP sockets on MGTs > > - Gigabit TEMAC with SG DMA > > Ok, what I can remember from running in Virtex-4FX issues (information > from about one year ago, so maybe that has changed meanwhile) with ethern=
et:
> > SG-DMA used to be buggy. > > HardTemac-Version used to be Silicon-Revision dependend (and was not > selected properly automatically) > > Maybe that helps a little. > > Regards, > > Lorenz
Thank you, well it not very encouraging :( the system uses LL_TEMAC_SGMII_V1_00a (user modified!) and MPMC2 #define GUI_VERSION 1.9 #define PCORE_VERSION _v2_10_a #define pcorename mpmc2_ddr2_pnncc_200mhz_x16_mt47h16m16_3 I know this is rather old and so on, but currently i have no options to upgrade the complete system to MPMC2 4.x anything you recall about the buggy? what did the buggy behavior cause? I think the hw revision is not an issues as the MGT seems to work ok under all circumstances, just the DMA engine DCR registers get wrong values, and then it all stops Antti
Antti wrote:
> On Dec 30, 2:11 pm, Lorenz Kolb <lorenz.k...@uni-ulm.de> wrote: >> Antti wrote: >>> Xilinx QUIZ 2008 >>> system setup >>> * Xilinx Virtex FPGA >>> - DDR2 memory >>> - SFP sockets on MGTs >>> - Gigabit TEMAC with SG DMA >> Ok, what I can remember from running in Virtex-4FX issues (information >> from about one year ago, so maybe that has changed meanwhile) with ethernet: >> >> SG-DMA used to be buggy. >> >> HardTemac-Version used to be Silicon-Revision dependend (and was not >> selected properly automatically) >> >> Maybe that helps a little. >> >> Regards, >> >> Lorenz > > Thank you, well it not very encouraging :( > the system uses > > LL_TEMAC_SGMII_V1_00a (user modified!) > and MPMC2 > #define GUI_VERSION 1.9 > #define PCORE_VERSION _v2_10_a > #define pcorename mpmc2_ddr2_pnncc_200mhz_x16_mt47h16m16_3
Ah, sorry, I'm out. Until now I only used the old fashioned way (PLB_Temac + HardTemac) as I didn't need the extra performance of a MPMC, sorry. But please also check Your version of the hard_temac IP-Core: e.g. for the silicon-revision of our ML403s we needed BEGIN hard_temac PARAMETER INSTANCE = hard_temac_0 PARAMETER HW_VER = 3.00.a though EDK encouraged us to use 3.00.b instead... Good luck, anyway, Lorenz
Hi Antti,

Try disabling cache if it is enabled.
Try increasing the stack.

Also, take a look at the old GSRD reference design using MPMC and LL_TEMAC. 
It used to work quite reliably but it was long time ago since I tried it 
last time.


/Mikhail 


Antti <Antti.Lukats@googlemail.com> wrote:
> PC sends very little amount of small UDP > packets that are responded by FPGA
I'm wondering if greatly increasing the volume of packets going from the PC to the FPGA would make the problem reproduce faster, etc. Do you have the flexibility to change the PC side to increase or even flood it with status checks or some noop command? G.
On Dec 31 2008, 6:57=A0pm, "MM" <mb...@yahoo.com> wrote:
> Hi Antti, > > Try disabling cache if it is enabled. > Try increasing the stack. > > Also, take a look at the old GSRD reference design using MPMC and LL_TEMA=
C.
> It used to work quite reliably but it was long time ago since I tried it > last time. > > /Mikhail
I-Cache is enabled D-Cache is disabled, but i think the D-Cache invalidate calls are made, so its good idea to remove them (or check they are not called) Antti
On Dec 31 2008, 11:14=A0pm, ga...@allegro.com (Gavin Scott) wrote:
> Antti <Antti.Luk...@googlemail.com> wrote: > > PC sends very little amount of small UDP > > packets that are responded by FPGA > > I'm wondering if greatly increasing the volume of packets going from > the PC to the FPGA would make the problem reproduce faster, etc. =A0Do > you have the flexibility to change the PC side to increase or even > flood it with status checks or some noop command? > > G.
it seems to have relation yes, when demo app is running on PC the failure happens in longer time, when the real app is running failure seems to happen earlier. The real app sends more packets to FPGA I have not tried flooding yet, but i have monitored the Rx/Tx buffer descriptor list fill level, when working there is NEVER more than 1 incoming packet in the buffer chain so there is no slow overflow of the buffer descriptor chain Antti
On Jan 2, 10:20=A0am, Antti <Antti.Luk...@googlemail.com> wrote:
> On Dec 31 2008, 11:14=A0pm, ga...@allegro.com (Gavin Scott) wrote: > > > Antti <Antti.Luk...@googlemail.com> wrote: > > > PC sends very little amount of small UDP > > > packets that are responded by FPGA > > > I'm wondering if greatly increasing the volume of packets going from > > the PC to the FPGA would make the problem reproduce faster, etc. =A0Do > > you have the flexibility to change the PC side to increase or even > > flood it with status checks or some noop command? > > > G. > > it seems to have relation yes, when demo app is running on PC > the failure happens in longer time, when the real app is running > failure seems to happen earlier. The real app sends more > packets to FPGA > > I have not tried flooding yet, but i have monitored the Rx/Tx > buffer descriptor list fill level, when working there is NEVER > more than 1 incoming packet in the buffer chain > so there is no slow overflow of the buffer descriptor chain > > Antti
I hope I have finally found the real issue... a few days ago i had a "ISSUE LIST" in excel table where i note the possible issues, their probability, methods of testing, etc.. the table had 26 items. but one VERY important item was missing, something that should always be on the list: "stupid software bug" how could i had it missing on my list? the original software is not written by me, neither it is very good or robust or tested but.. it has been reported as working 100% in some occasions, so i assumed there is no systematic problem with it. (all assumptions are to be considered false) but, the RX BD list is initialized once!!! ONCE!! the software does not write the buflen any more after the initialization so the BD list gets dirty and is never cleaned/released. DMA will write num_received into buflen (what was previously set 2048) this buflen is after that no longer modified, neither the stats field i truly hope this is the problem. if not then next item to check on my list is DCM chaining introduced jitter making some unexplained odd behaviour for the MPMC/DMA/DDR2... i hope it is not the DCM jitter problem. Antti PS and if somebody thinks i should have seen it earlier? i compared some of the code with Xilinx example code and there was also no BD reinit, so i did not check deeper in the drivers. But the drivers cant do that part, so the code is really just missing.
>a few days ago i had a "ISSUE LIST" in excel table >where i note the possible issues, their probability, methods of >testing, etc.. > >the table had 26 items. > >but one VERY important item was missing, >something that should always be on the list: > >"stupid software bug"
Don't overlook the smart software bugs. Many years ago, as a project was wrapping up, I made a list of the places where a bug could come from. I wish I had saved a copy. The list included bugs in microcode bugs in microcode assembler bugs in data sheet The one that I would have missed if I hadn't done it: bugs in my reading of a datasheet -- These are my opinions, not necessarily my employer's. I hate spam.