## Design and Implementation of Pentium-M Based Floswitch for Intracluster Communication

Veerappa chikkagoudar, Dr. U. N. Sinha, Prof. B. L. Desai.

veereshcg@yahoo.co.in, uns@flosolver.nal.res.in, bldesai@bvb.edu Department of Electronics and Communication B. V. Bhoomaraddi college of Engg. And Tech. Hubli-580031

### Abstract:

Flosolver MK6 is a Parallel processing system, based on distributed memory concept and built around Pentium-III processors, which acts as processing elements (PEs). Communication between processing elements is very important, which is done through hardware switch called Floswitch. Floswitch supports both message passing as well as message processing. Message processing is a unique feature of Floswitch.

In existing MK-6 system, communication between PEs is done through the Intel 486-based Floswitch, which operates at 32MHz and has 32bit wide data path. The data transfer rate and floating point computation of existing switch need to be increased.

The objective of the present project is designing a Floswitch board based on Intel Pentium-M processor with its chipsets, which provides 64bit data path for the flow and processing of data at increased speed.

The Pentium-M Processor has core frequency of 1.6 GHz and has 400-MHz source synchronous processor system bus. It transfers data four times per bus clock, which improves performance. The front side system bus (FSB) is used to connect the processor with the Intel E7501 chipset Memory Controller Hub (MCH).

Assembly code is written for processor initialization and MCH configuration. Decoding logic is written on FPGA using Verilog code for DPM read and write operation.

### 1. Introduction:

Flosolver project was started in 1986 at National Aerospace Laboratories (NAL). The mandate for the project was to design, develop, fabricate and use a suitable parallel processing computer for application to fluid dynamical and aero dynamical problems, [1].Since 1986, six generations of Flosolver machine have evolved namely Flosolver MK-1, MK-2, MK-3, MK-4, MK-5 and MK-6.

Flosolver MK-6 is the latest of the parallel computer based on 128 Pentium III processors (which act as processing elements, PEs) in 64 dual processor boards each with 1GB RAM and 80 GB HDD. It is essentially a distributed memory system. A group of four Dual processor boards with a Floswitch and an optical module is a natural cluster. 16 such clusters form the system. Processing elements (PEs) communicate through Floswitch (a communication switch) using PCI-DPM interface card. Clusters communicate through Optical module.

#### 1.1 Statement of Problem:

In the existing system of FLOSOLVER MK-6, Floswitch is based on the Intel 486 processor. This operates at 32MHz and has 32-bit data path. In the proposed Floswitch, to achieve higher data transfer rate and processing of data at higher speed, the existing Floswitch has to be enhanced using Pentium-M processor at 1.6 GHz.

#### **1.2 Design objective:**

To achieve higher data transfer rate and processing of data at higher speed imply or translate in to the fallowing requirements in the design:

- It should increase the speed and performance.
- The data path width has to be increased.
- The future enhancements should be kept in mind and provide the needed flexibility for coding the protocol.

To satisfy the above requirements, the project developments are undertaken to achieve the goals.

## **1.3 Methodology:**

Design and implementation of Floswitch using Pentium-M processor with its chipsets involves the fallowing steps:

- 1. Analyzing the existing system.
- 2. Choosing the essential components according to the requirements.
- 3. Designing the schematics using ORCAD 9.1(board design tool).
- 4. Analyzing the timing requirements of all the components.
- 5. Assembly level programming for processor initialization and configuration of Memory Controller Hub (MCH).
- 6. Developing the Verilog code for interfacing DDR memory and DPM.

## 2. Analysis of the Existing Floswitch:

In Flosolver Mk-6 parallel computer, the existing Floswitch is based on i486 processor, the switch also consists of a CPLD, Buffers, Latches and EPROM etc.The Block Diagram of existing Floswitch is as shown below.

The **Floswitch** provides an intelligent programmable interface and serves as a parallel bus interface, participating in all bus operations.

When the system executes parallel algorithm, the task is broken to smaller sub-tasks and fed to all the processing elements. Each processing element is autonomous and does not interfere in others operation, except message synchronization.

The processing element puts the data on to Dual Port Memory (DPM), which is connected to Floswitch on the other side. The PE specifies the kind of operation that has to be done with the data in a particular block called the command block. The Floswitch recognizes this while polling the command block. Once the command is set, the switch services the request by performing the required operation and provides the result to the destination PE's DPM. From the DPM, the PE can get the required data. The interfacing between the PE and the Floswitch is done through the IPC card. This is the place where interaction between PEs takes place.

In the existing system of Flosolver MK-6, the communication between processing elements is through a Floswitch, which is based on the Intel 486 processor, which operates at 32MHz and has 32-bit wide data path. With this speed and 32-bit wide data path, the data transfer rate is 128 MBPS. The data transfer rate and floating-point computation of existing switch is need to be increased for better performance of Flosolver MK-6[2].

To achieve higher data transfer rate and floatingpoint computation in the proposed communication switch, the **Intel Pentium-M processor** with its **chipsets are** used.



# 2.1 Consideration for selecting Pentium-M processor:

The Intel Pentium M processor based on 90 nm process technology featuring 2-MB L2 cache and 400-MHz front side bus (FSB) is the next generation high-performance, low-power mobile processor based on the Intel Pentium processor architecture.

The processor maintains support for MMX technology and Internet Streaming SIMD instructions and full compatibility with IA-32 software.

In the proposed communication switch the **Pentium-M processor** has core frequency of 1.6 GHz, and 400-MHz source synchronous processor system bus. It transfers data four times per bus clock, which improves the performance.

In Pentium-M processor, the A (31: 3) # (Address) defines a  $2^{32}$  byte physical memory address space; these pins transmit the address of transaction. And D (63: 0) # are the data signals, these signals provide a 64-bit data path between processor system bus agents. In this project Intel E7501 chipset Memory Controller Hub (MCH) is used for Pentium-M processor. Memory Controller Hub (MCH) is the component that contains the system memory interface and the processor interface. It communicates with Input/Output Controller Hub (ICH-3) via the Hub Interface (HI). At 100 MHz bus clock the address signals are double pumped (2x) to run at 200MHz. At 100 MHz bus clock the data signals are quad pumped (4x) to run at 400MHz.

The system bus, used to connect the processor with the E7501 MCH has 400 MHz, With this system bus speed and Pentium-M processor's 64 bit wide data path, a maximum data transfer rate of 3.2 GBPS can be achieved.

## 2.2.1 Advantages of Pentium-M Processor:

The fallowing are essential points where Pentium-M processor score over i486 processor in Floswitch design.

• The 64-bit wide memory is important to double precision floating-point data. Because of the change to a 64 bits wide data bus, the Pentium-M processor is able to retrieve floating-point data with one read cycle, instead of two as in i486. This causes the Pentium-M processor to function at a higher throughput than an i486.

• In an i486 with its unified cache, a program that was data-intensive quickly filled the cache allowing little room for instructions. This slowed the execution speed of the i486. In Pentium-M processor this cannot occur because of the separate instruction cache and data cache. Processor has 32KB instruction cache and 32KB write back data cache [4].

Design based on Pentium-M processor has chip set of MCH, ICH-S, and FWH, which enables better management of memory, I/O and other peripherals.

### 3. Design and Implementation:

The design and implementation of Floswitch based on Pentium-M processor is divided into four Phases, which are discussed below.

**Phase I:** In the first phase, orcad design is made, component is selected and libraries are created for selected components and then integration of the components is done.

**Phase II:** In the second phase, assembly level programming for processor initialization and configuration of Memory Controller Hub (MCH) is done.

**Phase III:** In the third phase, the Verilog code is developed to interface DDR SDRAM with DPM, for DPM read and write operation.

**Phase IV**: Simulation and testing is carried to ensure the proper working of the Floswitch.

### 3.1 Block Diagram of Pentium-M Floswitch:



#### **3.1.1 On Block Diagram:**

This block diagram describes the design details of Pentium-M processor based Floswitch. Front Side Bus (FSB) of 400MHz connects Pentium-M processor to MCH.

The E7501 chipset consists of three major components :( 1) Memory Controller Hub (MCH). (2) I/O Controller Hub (ICH). (3) PCI/PCI-X 64-bit Hub.

MCH is the central hub for all data passing through processor via system bus, memory via memory interface, I/O via I/O interface.

The DDR memory interface signals of MCH are connected to FPGA.Glue logic is written on FPGA to interface synchronous DDR memory with asynchronous Dual Port SRAM (DPM) using Verilog code. Thus DPMs are controlled by FPGA. Floswitch has four independent DPMs each with 64-bit data length and one DPM with 32-bit data length. In intracluster communication, the Floswitch will access one port of DPM and the Processing Elements will access other port through 64-bit PCI card.

Presently existing optical module is of 32-bit data length. This module is interfaced with only one DPM of 32-bit data length to provide Inter-cluster communication.

The Floswitch needs a permanent memory for the storage of operating system and boot up program. Hence FWH flash memory is used. The functionality of the Floswitch is written in FWH flash memory. Flash memory is interfaced to Pentium-M processor through ICH-S and MCH using LPC (Low Pin Count) Interface.

### **3.2 On Building Blocks:**

#### **Intel Pentium-M processor:**

The Intel Pentium M Processor based on 90 nm process technology featuring 2-MB L2 cache and 400-MHz front side bus (FSB) is the next generation high- performance, low-power mobile processor based on the Intel Pentium processor architecture.

## Intel - E7501 chipset Memory Controller Hub (MCH):

The E7501 chipset consists of three major components: Intel E7501 Chipset Memory Controller Hub (MCH), Intel I/O Controller Hub 3-S (ICH3-S), and the PCI/PCI-X 64 bit Hub 2.0 (P64H2).

MCH provides the system bus interface, memory controller, hub interface for legacy I/O, and high performance hub interface for PCI/PCI-X bus expansion.

## Intel 82801CA I/O Controller Hub 3-S (ICH3-S):

The I/O Controller Hub component that contains the primary PCI interface, LPC interface. It connects to the MCH's 8-bit Hub Interface 1.5.

The ICH3 provides extensive I/O support. Functions and capabilities include:

• PCI Local Bus Specification, Revision 2.2compliant with support for 33 MHz PCI operations.

• Low Pin Count (LPC) interface

• Firmware Hub (FWH) interface support

### Firmware Hub Flash Memory:

The FWH is a four Mbit (512Kb x8) non-volatile memory that can be read, erased and reprogrammed. These operations can be performed using a single low voltage (3.0 to 3.6V) supply. The memory is divided into 8 blocks each of 64KB.

### Virtex-II platform FPGA:

The Virtex-II family is a platform FPGA developed for high performance from low-density to high-density designs that are based on IP cores and customized modules. The family delivers complete solutions for telecommunication, wireless, networking, video, and DSP applications, including PCI, LVDS, and DDR interfaces.

Some of the main features of virtex-II FPGA are:

- 3 million system gates.
- 420 MHz internal clock speed.
- Maximum of 720 I/O pads.
- 1.7 Mb of dual port RAM in 18 Kbit block select RAM
- High performance interfaces to external memory like DRAM, SDR/DDR,
- SDRAM and SRAM interface.

#### **Dual port Memory (DPM):**

This device provides two independent ports with separate control, address, and I/O pins that permit independent, asynchronous access for reads or writes to any location in memory. The 70V659/58/57 can support an operating voltage of either 3.3V or 2.5V on one or both ports, controlled by the OPT pins. The power supply for the core of the device (VDD) remains at 3.3V.

## 4. Assembly Code for Processor Initialization and MCH Configuration:

Assembly code is required for processor initialization and configuration of Chipset Host controller registers for proper MCH (Memory Controller Hub) operation. This assembly code is written using MASM 6.11 assembler and its description is given below.

Microprocessor begins operation in the real mode by default whenever power is applied or the microprocessor is reset.

#### 4.1 **Processor initialization:**

Address generated by the processor after power on stage is **FFFFFF0h** [9]. The processor address to FWH flash memory at power up will be **7FFF0h**, which will be of processor address FFFFFFF0h, at this address using far jump instruction **ea 20 00 00 F0** (IP: CS), the starting location for code will be obtained as 70020h. Since stack is defined from 0000h to 001fh. The code will start from 70020h location as shown in the assembly code below. Here cache is disabled by setting the field CD =1 in CR0 (Control register 0). The loop is checked with the counter FFFFh and then stack initialization is done.

Before using the protected mode operation, the address of the global descriptor table and its limit

are loaded into the GDT register [4], then moving to protected mode of operation by setting bit 0 (PE) in CR0.

## 4.2 Configuration of MCH:

After moving in to the protected mode of operation, the Chipset Host controller registers are configured for MCH Operation. MCH contains two sets of software accessible registers, control registers and internal configuration registers. **Control register**, which access to PCI Configuration space, and **internal configuration registers**, which reside within the MCH, which controls the DRAM configuration and other chipset operating parameters.

The MCH contains two registers that reside in the processor I/O address space. They the Configuration address are (CONFIG ADDRESS) register and the Configuration Data (CONFIG\_DATA) register. CONFIG\_ADDRESS is a 32-bit register that can be accessed as a Dword. This register contains the bus number, device number, function number and register number. CONFIG\_DATA is a 32-bit Read/Write window into configuration space that is referenced by this register is determined by the contents of CONFIG ADDRESS.

The set of host controller register configured are:

**DRB [0:7]—DRAM Row Boundary Register:** The DRAM Row Boundary Register defines the upper boundary address of each DRAM row with a granularity of 32 MB in single-channel mode. Since MCH supports 128-Mb, 256-Mb, 512-Mb DRAM Densities. This register is configured for 512-Mb DRAM.

**DRA [3:0]—DRAM Row Attribute Register:** The DRAM Row Attribute Register defines the page sizes to be used for each row of memory. This register is configured for 16KB (single channel) page size.

**DRT—DRAM Timing Register:** This register controls the timing of the DRAM controller.

**DRC—DRAM Controller Mode Register:** This register controls the mode of the DRAM controller.

**TOLM—Top of Low Memory Register:** This register contains the maximum address below 4 GB that should be treated as main memory. The memory address found in DRB7 reflects the top

of total memory. This register contains the address that corresponds to bits 31:27, with 128 MB Boundary. So in single channel mode, 8GB-32 of host-address memory space is available.

Global Descriptor Table (GDT) [4] is used to define the base address, limit (size of memory), and access rights of the particular memory devices.

## 5. Verilog Implementation on FPGA

The MCH considered for the design, provides with two DDR interface channels. Only channel A is used in the design and channel B is disabled. It may be noted here that it does not interface with DPM directly. Therefore, there is a need to have an interface between the SDRAM interface and DPM via some gate arrays.

This is accomplished in the present design, by routing the DDR memory interface signals of MCH and DPM interface signals to the FPGA. The two interface signals are coupled together by a Verilog module to translate the information to and fro between the interfaces.

## 5.1 Address translation and Decoding

The MCH contains address decoders that translate the address received on the host bus or the hub interface. Decoding and translation of these addresses vary with the three SDRAM devices. Also, the number of pages, page sizes, and densities supported vary with the device. The MCH supports 128-Mb, 256-Mb, and 512-Mb SDRAM devices. The multiplexed row/column address to the SDRAM memory array is provided by the memory bank select and memory address signals. These addresses are derived from the host address bus as defined by Table shown below for 512 Mb SDRAM device.

| Tech(Mbit) | Row size<br>Page size |             | BA1 | BA0 | A12 | A11 | A10 | A9 | A8 | A7 | A6 | A5 | A4 | A3 | A2 | A1 | A0 |
|------------|-----------------------|-------------|-----|-----|-----|-----|-----|----|----|----|----|----|----|----|----|----|----|
| 512        | 512<br>MB             | R<br>O<br>W | 14  | 15  | 28  | 26  | 25  | 24 | 27 | 16 | 23 | 22 | 21 | 20 | 19 | 18 | 17 |
|            | 16KB                  | C<br>O<br>L |     |     |     | 13  | AP  | 12 | 11 | 10 | 9  | 8  | 7  | 6  | 5  | 0  | 0  |

Table: Address translation and

## Decoding

### 5.2.1 Address for DPMs

Floswitch has four independent memory blocks each with 64-bit data length. (Set of two DPMs) and One DPM with 32bit data length for optical switch. Totally, Five DPMs are there in the Floswitch [6]

During the active command, the three higher order address bits are latched and used for selecting one of the five DPMs and the Six of the row address bits are used as the higher order address bits for the DPMs.

That is when Row Address Strobe is active (RAS# = 0) indicates valid row address.

#### Active Command is:

((ddr\_cs\_n\_i [0] == 1'b 0) && (ddr\_ras\_n\_i == 1'b0) &&

 $(ddr_cas_n_i == 1'b1) \&\& (ddr_we_n_i == 1'b1)).$ 

#### adr [5:0] <= {ddr\_ma\_i [4:0],ddr\_ma\_i[7]};

That is processor address A (21:16)

When this command occurs the Selection of one of the five DPMs is based on the 3 MSB bits of the row address.

i.e. **dpm\_sel [2:0]** <= {**ddr\_ma\_i[9**], **ddr\_ma\_i[6:5**]};

That is processor address A (24:22)

When Column Address Strobe is active (CAS# = 0) indicates valid column address and initiates a transaction fallowed by read/write command.

## **Read command is:**

Address for DPMs is: dpm\_adr[15:0] <= {adr[5:0],ddr\_ba\_i[0],ddr\_ba\_i[1],ddr\_ma\_i[11],d dr\_ma\_i[9:3]}; That is processor address A (21:6).

#### Write command is:

((ddr\_cs\_n\_i [0] == 1'b 0) && (ddr\_cas\_n\_i == 1'b0) && (ddr\_ras\_n\_i == 1'b1) && (ddr\_we\_n\_i==1'b0))

Address for DPMs is:

dpm\_adr[15:0] <= {adr[5:0],ddr\_ba\_i[0],ddr\_ba\_i[1],ddr\_ma\_i[11],d dr\_ma\_i[9:3]};

That is processor address A (21:6).

#### 6 Results

This section contains the DPM read and write operation. The data transfer between Floswitch and processing elements will take place through dual port memory (DPM). In this design Floswitch has five DPMs; four DPMs are connected to four Nodes of the cluster, one DPM for Optical switch.

DPM control signal such as **oe** (output enable), **ce** (chip enable) and **rw** (read /write) signals are used to control the DPM read and DPM write operation.

#### 6.1 DPM Read Operation

When oe = 0, ce = 0 and rw = 1 then 32 bit DPM data will be read by the FPGA.

Data Transfer from DPM to FPGA

When dpm\_sel = 3'b001; dout [31:0] <= dpm1\_data\_io [31:0];

When dpm\_sel = 3'b010; dout [31:0] <= dpm2\_data\_io [31:0];

When dpm\_sel = 3'b011; dout [31:0] <= dpm3\_data\_io [31:0];

When dpm\_sel = 3'b100; dout [31:0] <= dpm4\_data\_io [31:0];

When dpm\_sel = 3'b101; dout [31:0] <= dpm5\_data\_io [31:0];

#### **6.2 DPM Write Operation**

When oe = 1, ce=0 and rw = 0 then 32 bit data will be written on DPM memory location.

Data Transfer from FPGA to DPM

When dpm\_sel = 3'b001; dpm1\_data\_io [31:0] <= din[31:0];

When dpm\_sel = 3'b010; dpm2\_data\_io [31:0] <= din[31:0];

When dpm\_sel = 3'b011; dpm3\_data\_io[31:0] <= din[31:0];

When dpm\_sel = 3'b100; dpm4\_data\_io[31:0] <= din[31:0]; When dpm\_sel = 3'b101; dpm5\_data\_io[31:0] <= din[31:0];

#### 7. Conclusion

This project aims to extend the capability of existing Floswitch from 32MHz, 32-bit data path to 400MHz, 64-bit data path. This is a substantial enhancement. With this view Pentium-M was selected for which front side bus (FSB) works at 100MHz. To implement this along with the Pentium-M processor at 400MHz, the chipset 7501MCH is used to interface FSB at 100MHz. The programming of the MCH has to be done to get the desired effect. But the MCH have many undocumented registers. By trails many have been programmed, the status is the following:

For every data read from DPM two read cycles are needed. However, write cycle is normal; it needs only one cycle.

Summarizing, the main activities done during the project work are:

In the first phase, orcad design is carried out using **ORCAD Capture tool.** Component selection is made and libraries are created for selected components and then integration of the components is done.

In second phase, assembly programming for processor initialization and configuration of Memory Controller Hub (MCH) is done using **MASM 6.11** assembler.

In the third phase, decoding logic for synchronous DDR SDRAM and asynchronous DPM interface is written in active HDL (Verilog) on FPGA using **Xilinx ISE 7.1i tool**, for DPM Read and Write operation.

Further scope of study is to implement the cache and data transfer in burst mode.

## 8. Biblogragphy

[1] U.N. Sinha, Deshpande M.D and Sarasamma.V.R

"Flosolver, Parallel Computer for Fluid Dynamics",

Current Science. Volume **57**, Page 1277-1285 1988.

[2] U.N. Sinha, Deshpande M.D & Sarasamma V.R

"Flosolver, A Parallel Computer" Super Computer, Volume 4 Page 37-42 1989.

- [3] U.N. Sinha, Deshpande.M.D and Sarasamma V.R. "Flosolver, a Parallel Computer" Super computer, vol [4] pages 37 - 42, July 1989
- [4] Barry B Brey (2000). "The Intel microprocessors-Architecture, Programming and Interfacing" Prentice Hall of India, Fourth Edison.
- [5] Douglas V Hall (1991) "Microprocessors and Interfacing", Tata Mc Graw Hill.
- [6] DFS0509, AnandarajD, JagannadhamVV, Venkatesh K.S, Sunilkumar, Rajalakshmy sivaramakrishnan. "Hardware Design Document for Pentium Based Floswitch"
- [8] Verilog HDL, Samir palnitkar.
- [9] IA-32 Intel Architecture software developer's manual, Volume-3: System programming guide.
- [10] IA-32 Intel Architecture software developer's manual, Volume-2: Instruction Set Reference.