Notes and Updates

June 28th, 2019

Goals of the Project:

  • Have an FPGA do an on-demand image convolution when the PC tells it to                     
  • Build a convolution hardware framework using Verilog, then using (low level) C have to PC send commands/image data to the FPGA
  • Explore the implementation of neural networks in FPGAs
    • Map out resource usage, latency for deep neural network architectures and hyperparameters
      • Latency refers to the total time (typically expressed in units of “clocks”), required for a single iteration of the algorithm to complete
    • Demonstrate deep learning techniques in FPGA applications

Initial Planning: 

  • Hardware diagram:
    • System, top-level blocks are:
      • PC & FPGA, connected via USB, “serial over USB”
    • FPGA, top-level blocks might be:
      • Usb_interface
      • Stream_parser
      • Line_buffer
      • Image_data_fetcher
      • Convo_unit
      • Output_formatter
  • Software diagrams:
    • Windows application
      • Shows input data, processing (decompression of images, reading convolutional weights from a file, etc), communication via serial read/write
      • File formats (e.g. how weights are accessed)
      • Key data structures
      • Communications protocol for serial
  • Verification:
    • Bit-accurate emulation of hardware processor in C/C++, Java or Python, MATLAB, Simulink, etc.
    • Verilog testbenches for various hardware modules and subsystems (groups of modules) and end-to-end FPGA design verification
    • System verification plan – what data will you use, how will you know if you are successful?

Who will do what? 

  • Who will own PC/Windows software?
    • Paul
  • Who will own software emulator of hardware?
    • Paul with help from Hussain
  • Who is in charge of the data transfer protocol (PC to FPGA and vice versa)
    • Ryan
  • Who will own Verilog architecture – specification of all top-level blocks (name, pinouts, behavior), and protocol-accurate specification of all interfaces between all top-level blocks
    • Hussain
  • Who will own design of each top-level Verilog block?
    • Hussain and Ryan
  • Who will own project management (create and track schedule of tasks and milestones)?
    • Kiera
  • Who will own the website creation and maintenance?
    • Kiera
  • Who will take notes from each team meeting (including weekly with me), and publish notes on website?
    • Kiera
  • Who will manage Git repo, configurations, releases?
    • Hussain

July 7th, 2019

  • USB/Ethernet
    • Should we stick to USB or would it be worth researching Ethernet as a potential transmission medium?
  • What is our project going to be?
    • Send convolution operations through c level program, send image data, communicate through USB 
      • How much memory do we have on an FPGA?
    • PC keeps track of data being transmitted, FPGA just runs until it runs out of data
    • What modules do we need?
      • PC Side — Host program, written in C, in a linux environment
      • Data transfer out of and onto the FPGA, PC image parts 
      • Convolution reconstructor?

Github created: Contains the pseudocode describing the functionality for a Host-FPGA Convolutional Accelerator 

July 28th, 2019

Transmission medium should be USB/UART

Total RAM: 500 kilobytes

Requirements (for your customer)

  • Image
    • 512x512x3 input initial
    • “Same” setting initial, “valid” setting stretch
  • Kernel
    • 7x7x3 size initial
    • Coefficients are Q0.15 fixed point
  • Output
    • 1 feature map initial, n feature maps stretch
  • UI
    • fileIO user program initial, API stretch

Specifications:

  • Specify hardware, clock rate, BAUD rate ?
  • USB and serial
  • R0B0G0 (ex. to send pixels) byte by byte

August 29th, 2019

  • Started working on C-based image parser, breaking the image down, streaming image, pseudocode for that
    • POST PSEUDOCODE
  • How do you drive a UART controller from a c program?
  • Linux based; tty logical device
  • Break up image, get raw bytes, distribute those
  • Using GCC

*Hardware complete by the end of the semester

Suggestions from Dr. Pearlstein:

  • Use 16 bit number for quantization and noise
  • Forces you to have 4 leading 0’s which you won’t represent… people use numbers greater than 1
  • What is the dynamic range? …. you can always scale it finding min and max, preprocess and normalize
    • Won’t scale on FPGA
    • Avoid saturation to the best of our abilities
    • Dynamic allocation
  • Chunky vs Planar?
    • Chunky

September 2nd, 2019

  • Changing image size to 512x512x3 
    • This is what most conv nets use, not bottlenecked by memory
  • Maintain Kernel size of 7×7

September 4th, 2019

  • Uart block gets us data, with a FIFO

High Level Requirements:

  • What does the FPGA do?
    • Reads data from the PC
    • Stores kernel data
    • Stores part of the image
    • Compute sum of products (SOP) for image parts kernel
    • Send result to PC
  • What does the PC do?
    • Load user image
    • Send image to FPGA
    • Receive enddata from FPGA 
    • Present final output to user
  • Test Plan
    • Send in a 512x512x3 image (the data), and validate the output 

Bottleneck–UART

September 5th, 2019

Block Diagrams/Architecture

Level 0: High-level

Level 1: Mid-level

PC:

N memories, n-1 memories (n-1 lines)

FPGA:

September 12th, 2019

  • Top level diagram
  • Drawing that shows connectivity 
  • Create standard naming convention
  • Prototypes for blocks by next week

September 19th, 2019

  • Successful implementation of a FIFO for both the sending and receiving of the UART module 
  • Simulations looked promising a bitstream to test the design in hardware was created
  • Implemented Verilog functionality for the image window and sum-of-products multipliers
  • Working log in google sheets was created and updated

    To be completed for next week: 
  • Module fully tested in hardware to determine the maximum baud rate for the project
  • Begin working on the rtl for the line buffer using BRAM blocks
  • Implementing the controller FSM and starting to test each module
  • Figure out the fixed point scaling during the kernel/pixel multiplication
  • IO domain

September 26th, 2019

  • Focus on editing the wordpress website, look into add on features the school could offer, focus less on investing in an io domain
    Image Window/Test Benches: 
  • Write last two test benches, unsigned 8 bit image, CNN are signed (except input which is unsigned)
  • RTR RTS hardware code complete, tested shift window with high level test bench, make sure it meets timing, shift window meets timing window
    UART:
  • Stop bit, full bit long, check sender, framing errors, start-stop bit, produce null character
  • Figure out how to enable serial flow control, 2 way flow control
  • Try to flow control the modem (FPGA)
  • Configure the chip
  • Look to parse CSV file in python…write in C for speed purposes
  • Front load progress, aim to get open cv, run storage blocks
    Other:
  • Add architecture document to the gantt chart
  • Action items with owner and due date
  • Specify who works on the architecture document– get done by this time next week
  • 1st draft of test plan on FPGA — Kiera and Hussain
  • 1st draft of test plan for uart — Ryan and Paul

October 3rd, 2019

Current Status: Behind

  • RTS- Request to send, transmit side
  • More block diagrams and images
  • BRAM block complete, work on testing this
  • Potentially create timing diagrams, hand draw cycles
  • Not using much memory
  • Need to make more progress, particularly on the software side
  • Block memory generator
  • Working on the fixed point multiplier for the convolution unit

October 17th, 2019

  • Figure out the protocol together
  • Planned approach:
    • BRAM is working
    • Ran one simulation, BRAM has been filled, writing and reading data at the same time, read and then write on the same cycle ?
    • Dual port? Assume reading and writing are not occurring simultaneously

— Is cycle rate faster than pixel rate?

— Limited by the rate of multipliers

–CNN 

  • Unified state flow
  • UART values and determine the protocol
  • Discussed the interfacing
    • Multiple interfaces need to be specified (but might not be necessary)
    • Always ready to receive, read then set rtr low once value is written to the BRAM
  • State Image: Image to image start up, build up window before you can output any points

What needs to be done:

  • How the start-up works
  • Condition the start up behavior
  • Pull out garbage until you get the data you want

Start up behavior — Hussain

Protocol — Paul

Architecture

  • ILB, pixel rate, iron out state flow of SOPU (hussain will meet with dr. pearlstein)
  • UART, website (mr. lee), fit in work on preprocessors, get the protocols done
  • Hardware set up to send, it, find results live and script that in a few different 

Serial test bench model

  • Write the verilog code to read the image and kernel, send image and kernel data through verilog
  • Verilog will read CSV file, kernel and image (stored as raw bytes RGB RGB) xnview converter, convert image to raw file

*Write down the test plan

October 24th, 2019

  • Write program to read data from the stream
    • fread to read in the entire file into the array
    • convert text file
    • 512×512
    • Matlab to read in files iamread
    • Take the array that you read and convert to csv file
    • flatten the array ; ) — converts any number 
  • Black white pixels, all 0’s, all 255’s, simple patterns
  • Architecture Document has been updated
    • Now, assume existence of sign bit and use a 9 bit system (even though it is an 8 bit system) c1.0.7
    • Add picture-with blocks and describe it
  • The SOPU will be sending data, interaction between the two images
  • Determine the max rate we can send (make sure to test this) 
  • Autobaud can detect
  • Make a gif of the window movement –Kiera

October 31st, 2019

  • Update architecture document
  • Implement Ryan’s new design and test with Hussain
  • Serial port is working
  • Line buffer synchronization
  • Must get the interface to work
    • (Try to make enhancements)
  • Create more efficient test benches

–Issue pass/fail 

–Create random data to test

–Self checking test

“Automated environment” 

Communication to and from the software

November 7th, 2019

-Continue working on graphics for the website

  • Issues
    • No physical hardware issue, as of right now
    • Not working on the terminal
  • 9600 BAUD
  • VM of windows 
  • Ryan made a java program for random values
    • Test bench to implement
    • Read in values and compare against output values
  • Hussain: wrote all of the RTL 
    • Interfacing between the UART and ILB (image line buffer), how data lines up, data distributed approach
      • Pulls the UART, ready to read, latches the byte, master reads the new byte, fills the kernel, when
      • Gets a pixel from the uart, send 
      • Interface: ILB, immediately take in the data as it comes
    • One to read and one to write to
    • Run test synthesis

November 14th, 2019

Self check is finished, random number generator will be used to test

  • Error with ILB fixed, memory spaces were too big
  • SOPU is in the testing phase
  • Integration and unit testing

Work on the documentation

What speed to use for UART?

  • 9600, slow, stalls for new data but at least it works
  • Port to Windows, having Linux issues
  • Linux on FPGA, FTDI chip
  • Get numbers on FPGA utilization, figure out the resources required

November 21st, 2019

  • Animations agreed upon, set to finish before senior project presentations
    • Paul is trying to get serial communication working, still Linux based
  • Ran program he wrote, would return the character T, or it would time out and not return anything, seemed random
    • Possible explanation: non-blocking will just come back, should not time out, can return null
  • Need a way to block certain data
  • Check the results, do nonclocking and see how much valid data is present, reading raw bytes from a byte stream
  • Submit an abstract

January 27th, 2020

  • Develop a software model for the FPGA
  • UART protocol complete, can communicate with the PC
  • C++ has better file io than Verilog so this can be exploited
    • When you do arithmetic operations in Verilog the result will be limited in size
    • Many advantages to coding in C
  • Write a tickle script for simulations for wave debugging
  • Log all signals from the very beginning
  • Hopefully verify and test, combine ILB and SOPU together to test
  • Post synthesis, post design
  • Using different clocks can result in timing violations, if you simulate the chip well enough you can find errors
    • Hassle to figure out what is wrong with the FPGA
  • Target for next week: C++ software model done, start hardware software post simulation and integrate those, hardware test the ILB, create new rtl to create a direction connection that can read and write the values at the same time, test the hardware
  • Decoupled verification, make a plan