Cache

Little notes on things that don't really deserve a long writeup.

Weird unsigned/int bug with divide in VivadoHLS

Reported here: Xilinx developer forum. Unfortunately, no Xilinx dev or forum administrator has confirmed this issue, but the minimal example shows how this bug can be replicated. You can download the minimal example by clicking here.

ap_uint/ap_int behaviour in VivadoHLS

Small program that demonstrates HLS data-types behaviour with ap_int and ap_uint data-types:

        #include <stdio.h>
        #include "ap_int.h"
        
        int main()
        {
            ap_int<8> acc = 12;
            ap_uint<8> res = 0;
        
            res += acc;
            printf("res = %d\n",(int)res);
        
            acc = -6;
            res += acc;
            printf("res = %d\n",(int)res);
        
            acc = -8;
            res += acc;
            printf("res = %d\n",(int)res);
        
            return 0;
        }
      

Output result is:

        res = 12
        res = 6
        res = 254
      

Conclusion: The accummulation happens correctly in signed integer, but the number is stored and printed as an unsigned integer.

Taskwarrior

Wow, this was hard to setup. Some steps that are not documented, especially for allowing a TW client on a different machine to sync with the server (e.g. some server on the VPN):

  1. On your taskwarrior server: sudo ufw allow PORT_NUMBER_YOU_WANT_OPEN_FOR_TW
  2. On your taskwarrior server: add a line in /etc/hosts that has your IP address explicitly written out to your hostname. (e.g. 10.123.456.789 YOUR_HOSTNAME)
  3. Not entirely sure if this is needed, but this command was the first time I managed to get some response from my TW server (my PC) from my client (my laptop): ssh -L TW_PORT:localhost:22 USERNAME@IP_ADDRESS, where TW_PORT is the port I am using for TW server, and USERNAME/IP_ADDRESS are my login details for connecting to my TW server machine. Then, running task rc.taskd.server:localhost:localport somehow got a response with tasks on my remote server. This didn't really help me solve the problem though, I had to do the two steps above before I was able to run task sync without errors.

Regex back-referencing

Very useful for speeding up tedious tasks, e.g. creating verilog testbenches requires lots of wire declarations when connecting up the interface for the DUT. e.g. module Top ( a, b, c ), in the testbench we want to instantiate Top DUT ( .a(a), .b(b), .c(c) ), which can be done by a single well-constructed regex command. In vim-regex, to capture a matched string, you can do a visual select and use %s/\(YOUR_REGEX_TO_MATCH\)/\.\1(\1)/g. Here, your regex match is capturing groups, and you are back-referencing it using \1.

Another example: "James Bond" use s/\(.*\) \(.*\)/I'm \2, \1 \2/g, which will produce "I'm Bond, James Bond".

Dotfiles management

My dotfiles: https://github.com/sidmontu/dotfiles.

I followed this tutorial: https://www.atlassian.com/git/tutorials/dotfiles

On a new system, inside home directory, run:

  1. echo ".cfg" >> .gitignore
  2. git clone --bare https://github.com/sidmontu/dotfiles.git $HOME/.cfg
  3. alias config='/usr/bin/git --git-dir=$HOME/.cfg/ --work-tree=$HOME'
  4. config checkout

UG documentations for Vivado Tools / Related

PYNQ -- Holy grail of the PYNQ software framework.

UG902 -- High-level synthesis (details about HLS data-types, IP cores, pragmas, more)

WP509 -- RF data convert IP block, and understanding key parameters it exposes to the programmer.

SDSoC tutorials

DS926 -- RFSoC data sheet

PG269 -- RF data converter IP data sheet

UG1270 -- VivadoHLS optimization guide (pragmas and their effects mainly).

VivadoHLS pragma list -- HTML page of HLS pragmas, easier to read and navigate than the PDFs

PG085 -- AXI4 stream protocol IP suite for Xilinx.

RFSoC -- getting started guide (setup, tests, etc)

Collection of little oddities in FPGA tools

If you want to use the Xilinx FFT IP core in VivadoHLS, where you're repeatedly calling the HLS core function to evaluate on different vectors of data (not unrolled, since you might not want to instantiate many copies of the FFT core in hardware), then the HLS tool will give an incompatibility error on the interface data-type (ap_auto vs ap_fifo). To solve that, insert HLS interface pragma to declare the input ports of the FFT IP core as type ap_fifo. See: https://forums.xilinx.com/t5/Vivado-High-Level-Synthesis-HLS/ap-auto-incompatible-with-ap-fifo-error-when-using-FFT-IP/td-p/703730.

In VivadoHLS, if you wish to interpret an ap_uint/ap_int data-type as fixed-point data-type, i.e. you want the ap_fixed number to have the bit-exact value as the ap_uint/ap_int number, then you can use the .V statement. E.g. ap_fixed<8,6> fxp_num is a Q6.2 fixed-point number, and ap_uint<8> uint_num = 0x0f, by setting "fxp_num.V = uint_num", fxp_num bits will be set to 0x0f, which will be interpreted as 3.75. See: https://forums.xilinx.com/t5/Vivado-TCL-Community/A-question-about-Type-Conversion-in-Vivado-HLS/td-p/711413

No Clock-Domain-Crossing (CDC) in VivadoHLS. Which is weird, since you can specify a separate clock for your s_axilite ports (e.g. #pragma HLS interface port=axilite_port clock=config_clk). Either it's broken, or I need to investigate further whether the timing errors reported on the path can actually be ignored as false paths. See https://forums.xilinx.com/t5/Vivado-High-Level-Synthesis-HLS/no-clock-domain-crossing-in-HLS/td-p/933543.

How to connect interrupts in block design (or how not to): https://forums.xilinx.com/t5/Embedded-Linux/A-key-error-when-I-am-trying-to-access-a-DMA-IP-on-PYNQ-Z1-board/td-p/847817

For some obscure reason, if Xilinx tools (e.g. Vivado) is in your $PATH, some C/C++ projects report build errors - I observed this error from trying to build Ettus RFNoC framework using PyBOMBs. Removing Xilinx tools from the $PATH in a fresh terminal does the trick.

Pynq Jupyter environment sometimes fails to load even the basic pynq overlay modules. If you see a "ValueError: bad marshal data", then perhaps there are corrupt compiled python module files (.pyc). I've fixed this once by running "(sudo) find /usr -name '*.pyc' -delete". Stackoverflow link.


Python __slots__

Interesting little feature in Python: instead of a dynamic __dict__ of attributes, __slots__ is a statically declared dictionary you know your class instances will have, i.e. not mutable at runtime. Effect is you get better runtime performance (faster lookup), and savings in memory usage. See: Stackoverflow discussion

Very useful guide on choosing a loss function in Tensorflow

Stackoverflow link

Tensorpack custom DataFlow class

If you have a very large dataset that is dynamically-loaded from disk - i.e. if you keep track of pointers in your get_data() routine to yield data to the trainer at runtime, remember to reset these pointers appropriately at the beginning of the get_data() routine. This is because Tensorpack spawns multiple threads for feeding the Tensorpack trainer at runtime, and without a "reset_state()" function will cause out-of-range memory access errors.

Xilinx pblocks

Pblock is a physical block to constrain nets to physical areas on the FPGA. Pblocks can be nested, but Xilinx recommend at most one-level of nesting, and to avoid it altogether if possible. Pblock can be shaped into custom shapes (instead of just a rectangle) by clicking on the "add pblock rectangle" option when selecting the pblock you wish to reshape. The pblock properties can be exported to a xdc constraints file automatically using "save constraints as" option under "File". These constraints are basically tcl commands that look something like the following:

create_pblock "name you wish to give to the pblock"
add_cells_to_pblock [get_pblocks "name of pblock"] [get_cells -quiet [list "name of module you wish to assign to pblock"]
resize_pblock [get_pblocks "name of pblock"] -add {SLICE_"XY-COORD":SLICE_"XY-COORD"}
resize_pblock [get_pblocks "name of pblock"] -add {DSP48_"XY-COORD":DSP48_"XY-COORD"}
resize_pblock [get_pblocks "name of pblock"] -add {RAMB18_"XY-COORD":RAMB18_"XY-COORD"}
resize_pblock [get_pblocks "name of pblock"] -add {RAMB36_"XY-COORD":RAMB36_"XY-COORD"}

You can do a DRC check for floorplanning to make sure pblock specified for the module meets expectations. Basically, this is a sanity test.

Vivado has an auto-floorplanning tool as well (under Tools - Floorplanning - Auto-create Pblocks). Also worth checking out "Window - Phyiscal Constraints" tab, which will show connectivity between different Pblocks for better visualization of dataflow in your circuit. Individual logic can also be locked into exact locations so that there is consistency between runs (i.e. you don't get a completely different placed/routed solution that you can't quantify whether your changes made any difference between implementation runs).

Floorplanning can improve performance and consistency between runs. Some guidelines: floorplan what is necessary, do not over-floorplan. Choose modules to floor-plan that do not have connectivity to lots of other modules, as those modules are better to be broken apart to reduce critical path lengths. Use RTL hierarchy, DRC checks, properties and other Vivado tools to judge what is best and how to floorplan. Finally, floorplanning can be an iterative process.

Reference resource: https://www.xilinx.com/video/hardware/design-analysis-floorplanning-with-vivado.html

Arch Linux + bspwm + sxhkd

Weird thing I found that if, instead of using urxvt as the main terminal, you are using termite, then there is a weird idiosyncrasy with how a bspc rule has to be declared. In fact, it seems like urxvt also has an oddity from a stackoverflow post that I saw

Instead of :

bspc rule -a termite state=floating
bspc rule -a urxvt state=floating

You must declare the following (basically the capitalization) :

bspc rule -a Termite state=floating
bspc rule -a URxvt state=floating

for the rule to register when creating the window. By the way, what's with the weird names (sxhkd, urxvt)?

Problems: panel does not launch automatically on startx.

Custom IP on Pynq

Tested with Vivado(HLS) 2017.4

For reference: https://www.youtube.com/watch?v=Dupyek4NUoI

For reference: Pynq-Z1 board part number is xc7z020clg400-1

For reference: Pynq-Z1 board files from: https://github.com/cathalmccabe/pynq-z1_board_files (installation instructions in the .md file in repo)

Steps (roughly) :

1) Easiest way is to start with VivadoHLS, write your own HLS module that interfaces with the host CPU. Add HLS INTERFACE pragmas to define the I/O ports and their types. e.g. #pragma HLS INTERFACE s_axilite port=. You can choose axi, s_axilite, axis for port type (as far as I know for now). Also add #pragma HLS INTERFACE ap_ctrl_none port=return to turn off function call handshake, which creates a whole bunch of extra ports like ap_start, ap_done, ap_idle, and ap_ready, which might not be required. See https://www.xilinx.com/support/answers/55279.html for more info.
2) Run C synthesis, make sure there are no errors and you understand the warnings.
3) Click button for "Export RTL". We can use default settings and click ok. This will package your design as an IP. Choose the language you prefer as well (I go with Verilog mostly).
4) This is the first interesting part to pay attention to: Under "solution1" folder (assuming you're using the default solution name for Vivado synthesis process), go to impl>misc>drivers>"your ip name">src and open x"ip name"_hw.h. Make a note of all the addresses of each register that is being assigned to your ports. You'll need this later to be able to write your software Python drivers that send data to these memory-mapped AXI ports.
5) Switch over to Vivado.

AXI-STREAM interface on Pynq (with DMA)

Obscure artifact #1: Interrupts from AXI DMA block have to be handled by an AXI Interrupt controller, instead of simply concat into IRQ_F2P port of Zynq-PS. See https://forums.xilinx.com/t5/Embedded-Linux/A-key-error-when-I-am-trying-to-access-a-DMA-IP-on-PYNQ-Z1-board/td-p/847817

DFT/FFT footnotes

Mostly grabbed from a great semi-technical summary in a reddit comment, Source: https://www.reddit.com/r/explainlikeimfive/comments/9cbi8p/eli5_what_is_the_fast_fourier_transform_or_fft/e5axg1n/

Tensorflow visualizing the graph

https://www.tensorflow.org/guide/graphs#visualizing_your_graph

To see the graph of the model you're training, run tensorbord --logdir , and load in browser at port 6006 (e.g. localhost:6006)

Compilers

From: http://venge.net/graydon/talks/CompilerTalk-2019.pdf

Frances Allen "A Catalogue of Optimizing Transformations" (1971 paper). Write 8 passes to get ~80% best-case performance (used by many modern compilers). They are:

Support vector machines

Notes from Caltech lecture on SVMs: https://www.youtube.com/watch?v=eHsErlPJWUU

Kernel Methods

Notes from Caltech lecture on kernel methods: https://www.youtube.com/watch?v=XUj5JbQihlU