June 28th, 2019
Goals of the Project:
- Have an FPGA do an on-demand image convolution when the PC tells it to
- Build a convolution hardware framework using Verilog, then using (low level) C have to PC send commands/image data to the FPGA
- Explore the implementation of neural networks in FPGAs
- Map out resource usage, latency for deep neural network architectures and hyperparameters
- Latency refers to the total time (typically expressed in units of “clocks”), required for a single iteration of the algorithm to complete
- Demonstrate deep learning techniques in FPGA applications
- Map out resource usage, latency for deep neural network architectures and hyperparameters
Initial Planning:
- Hardware diagram:
- System, top-level blocks are:
- PC & FPGA, connected via USB, “serial over USB”
- FPGA, top-level blocks might be:
- Usb_interface
- Stream_parser
- Line_buffer
- Image_data_fetcher
- Convo_unit
- Output_formatter
- System, top-level blocks are:
- Software diagrams:
- Windows application
- Shows input data, processing (decompression of images, reading convolutional weights from a file, etc), communication via serial read/write
- File formats (e.g. how weights are accessed)
- Key data structures
- Communications protocol for serial
- Windows application
- Verification:
- Bit-accurate emulation of hardware processor in C/C++, Java or Python, MATLAB, Simulink, etc.
- Verilog testbenches for various hardware modules and subsystems (groups of modules) and end-to-end FPGA design verification
- System verification plan – what data will you use, how will you know if you are successful?
Who will do what?
- Who will own PC/Windows software?
- Paul
- Who will own software emulator of hardware?
- Paul with help from Hussain
- Who is in charge of the data transfer protocol (PC to FPGA and vice versa)
- Ryan
- Who will own Verilog architecture – specification of all top-level blocks (name, pinouts, behavior), and protocol-accurate specification of all interfaces between all top-level blocks
- Hussain
- Who will own design of each top-level Verilog block?
- Hussain and Ryan
- Who will own project management (create and track schedule of tasks and milestones)?
- Kiera
- Who will own the website creation and maintenance?
- Kiera
- Who will take notes from each team meeting (including weekly with me), and publish notes on website?
- Kiera
- Who will manage Git repo, configurations, releases?
- Hussain
July 7th, 2019
- USB/Ethernet
- Should we stick to USB or would it be worth researching Ethernet as a potential transmission medium?
- What is our project going to be?
- Send convolution operations through c level program, send image data, communicate through USB
- How much memory do we have on an FPGA?
- PC keeps track of data being transmitted, FPGA just runs until it runs out of data
- What modules do we need?
- PC Side — Host program, written in C, in a linux environment
- Data transfer out of and onto the FPGA, PC image parts
- Convolution reconstructor?
- Send convolution operations through c level program, send image data, communicate through USB
Github created: Contains the pseudocode describing the functionality for a Host-FPGA Convolutional Accelerator
July 28th, 2019
Transmission medium should be USB/UART
Total RAM: 500 kilobytes
Requirements (for your customer)
- Image
- 512x512x3 input initial
- “Same” setting initial, “valid” setting stretch
- Kernel
- 7x7x3 size initial
- Coefficients are Q0.15 fixed point
- Output
- 1 feature map initial, n feature maps stretch
- UI
- fileIO user program initial, API stretch
Specifications:
- Specify hardware, clock rate, BAUD rate ?
- USB and serial
- R0B0G0 (ex. to send pixels) byte by byte
August 29th, 2019
- Started working on C-based image parser, breaking the image down, streaming image, pseudocode for that
- POST PSEUDOCODE
- How do you drive a UART controller from a c program?
- Linux based; tty logical device
- Break up image, get raw bytes, distribute those
- Using GCC
*Hardware complete by the end of the semester
Suggestions from Dr. Pearlstein:
- Use 16 bit number for quantization and noise
- Forces you to have 4 leading 0’s which you won’t represent… people use numbers greater than 1
- What is the dynamic range? …. you can always scale it finding min and max, preprocess and normalize
- Won’t scale on FPGA
- Avoid saturation to the best of our abilities
- Dynamic allocation
- Chunky vs Planar?
- Chunky
September 2nd, 2019
- Changing image size to 512x512x3
- This is what most conv nets use, not bottlenecked by memory
- Maintain Kernel size of 7×7
September 4th, 2019
- Uart block gets us data, with a FIFO
High Level Requirements:
- What does the FPGA do?
- Reads data from the PC
- Stores kernel data
- Stores part of the image
- Compute sum of products (SOP) for image parts kernel
- Send result to PC
- What does the PC do?
- Load user image
- Send image to FPGA
- Receive enddata from FPGA
- Present final output to user
- Test Plan
- Send in a 512x512x3 image (the data), and validate the output
Bottleneck–UART
September 5th, 2019
Block Diagrams/Architecture
Level 0: High-level
Level 1: Mid-level
PC:
N memories, n-1 memories (n-1 lines)
FPGA:
September 12th, 2019
- Top level diagram
- Drawing that shows connectivity
- Create standard naming convention
- Prototypes for blocks by next week
September 19th, 2019
- Successful implementation of a FIFO for both the sending and receiving of the UART module
- Simulations looked promising a bitstream to test the design in hardware was created
- Implemented Verilog functionality for the image window and sum-of-products multipliers
- Working log in google sheets was created and updated
To be completed for next week: - Module fully tested in hardware to determine the maximum baud rate for the project
- Begin working on the rtl for the line buffer using BRAM blocks
- Implementing the controller FSM and starting to test each module
- Figure out the fixed point scaling during the kernel/pixel multiplication
- IO domain
September 26th, 2019
- Focus on editing the wordpress website, look into add on features the school could offer, focus less on investing in an io domain
Image Window/Test Benches: - Write last two test benches, unsigned 8 bit image, CNN are signed (except input which is unsigned)
- RTR RTS hardware code complete, tested shift window with high level test bench, make sure it meets timing, shift window meets timing window
UART: - Stop bit, full bit long, check sender, framing errors, start-stop bit, produce null character
- Figure out how to enable serial flow control, 2 way flow control
- Try to flow control the modem (FPGA)
- Configure the chip
- Look to parse CSV file in python…write in C for speed purposes
- Front load progress, aim to get open cv, run storage blocks
Other: - Add architecture document to the gantt chart
- Action items with owner and due date
- Specify who works on the architecture document– get done by this time next week
- 1st draft of test plan on FPGA — Kiera and Hussain
- 1st draft of test plan for uart — Ryan and Paul
October 3rd, 2019
Current Status: Behind
- RTS- Request to send, transmit side
- More block diagrams and images
- BRAM block complete, work on testing this
- Potentially create timing diagrams, hand draw cycles
- Not using much memory
- Need to make more progress, particularly on the software side
- Block memory generator
- Working on the fixed point multiplier for the convolution unit
October 17th, 2019
- Figure out the protocol together
- Planned approach:
- BRAM is working
- Ran one simulation, BRAM has been filled, writing and reading data at the same time, read and then write on the same cycle ?
- Dual port? Assume reading and writing are not occurring simultaneously
— Is cycle rate faster than pixel rate?
— Limited by the rate of multipliers
–CNN
- Unified state flow
- UART values and determine the protocol
- Discussed the interfacing
- Multiple interfaces need to be specified (but might not be necessary)
- Always ready to receive, read then set rtr low once value is written to the BRAM
- State Image: Image to image start up, build up window before you can output any points
What needs to be done:
- How the start-up works
- Condition the start up behavior
- Pull out garbage until you get the data you want
Start up behavior — Hussain
Protocol — Paul
Architecture
- ILB, pixel rate, iron out state flow of SOPU (hussain will meet with dr. pearlstein)
- UART, website (mr. lee), fit in work on preprocessors, get the protocols done
- Hardware set up to send, it, find results live and script that in a few different
Serial test bench model
- Write the verilog code to read the image and kernel, send image and kernel data through verilog
- Verilog will read CSV file, kernel and image (stored as raw bytes RGB RGB) xnview converter, convert image to raw file
*Write down the test plan
October 24th, 2019
- Write program to read data from the stream
- fread to read in the entire file into the array
- convert text file
- 512×512
- Matlab to read in files iamread
- Take the array that you read and convert to csv file
- flatten the array ; ) — converts any number
- Black white pixels, all 0’s, all 255’s, simple patterns
- Architecture Document has been updated
- Now, assume existence of sign bit and use a 9 bit system (even though it is an 8 bit system) c1.0.7
- Add picture-with blocks and describe it
- The SOPU will be sending data, interaction between the two images
- Determine the max rate we can send (make sure to test this)
- Autobaud can detect
- Make a gif of the window movement –Kiera
October 31st, 2019
- Update architecture document
- Implement Ryan’s new design and test with Hussain
- Serial port is working
- Line buffer synchronization
- Must get the interface to work
- (Try to make enhancements)
- Create more efficient test benches
–Issue pass/fail
–Create random data to test
–Self checking test
“Automated environment”
Communication to and from the software
November 7th, 2019
-Continue working on graphics for the website
- Issues
- No physical hardware issue, as of right now
- Not working on the terminal
- 9600 BAUD
- VM of windows
- Ryan made a java program for random values
- Test bench to implement
- Read in values and compare against output values
- Hussain: wrote all of the RTL
- Interfacing between the UART and ILB (image line buffer), how data lines up, data distributed approach
- Pulls the UART, ready to read, latches the byte, master reads the new byte, fills the kernel, when
- Gets a pixel from the uart, send
- Interface: ILB, immediately take in the data as it comes
- One to read and one to write to
- Run test synthesis
- Interfacing between the UART and ILB (image line buffer), how data lines up, data distributed approach
November 14th, 2019
Self check is finished, random number generator will be used to test
- Error with ILB fixed, memory spaces were too big
- SOPU is in the testing phase
- Integration and unit testing
Work on the documentation
What speed to use for UART?
- 9600, slow, stalls for new data but at least it works
- Port to Windows, having Linux issues
- Linux on FPGA, FTDI chip
- Get numbers on FPGA utilization, figure out the resources required
November 21st, 2019
- Animations agreed upon, set to finish before senior project presentations
- Paul is trying to get serial communication working, still Linux based
- Ran program he wrote, would return the character T, or it would time out and not return anything, seemed random
- Possible explanation: non-blocking will just come back, should not time out, can return null
- Need a way to block certain data
- Check the results, do nonclocking and see how much valid data is present, reading raw bytes from a byte stream
- Submit an abstract
January 27th, 2020
- Develop a software model for the FPGA
- UART protocol complete, can communicate with the PC
- C++ has better file io than Verilog so this can be exploited
- When you do arithmetic operations in Verilog the result will be limited in size
- Many advantages to coding in C
- Write a tickle script for simulations for wave debugging
- Log all signals from the very beginning
- Hopefully verify and test, combine ILB and SOPU together to test
- Post synthesis, post design
- Using different clocks can result in timing violations, if you simulate the chip well enough you can find errors
- Hassle to figure out what is wrong with the FPGA
- Target for next week: C++ software model done, start hardware software post simulation and integrate those, hardware test the ILB, create new rtl to create a direction connection that can read and write the values at the same time, test the hardware
- Decoupled verification, make a plan