Original filename: scipaper.pdf
This PDF 1.2 document has been generated by TeX output 2005.06.09:1357 / dvipdfm 0.13.2c, Copyright © 1998, by Mark A. Wicks, and has been sent on pdf-archive.com on 28/02/2017 at 20:41, from IP address 73.239.x.x.
The current document download page has been viewed 148 times.
File size: 211 KB (7 pages).
Privacy: public file
Download original PDF file
Hardware Acceleration of Database Operations Using
Nagender Bandi Sam Schneider Divyakant Agrawal Amr El Abbadi
Department of Computer Science
University of California, Santa Barbara,
schneid, agrawal, email@example.com
Research efforts in conventional CPU architectures over the
past decade have focused primarily on performance enhancement. In contrast, the NPU (Network Processing Unit) architectures have evolved significantly. The memory hierarchy of a typical network router features a Content-Addressable
Memory (CAM) interfacing with the NPU in the same way
as a DRAM interfaces with a traditional CPU. CAM technology provides very fast constant-time lookups over large
amounts of data and facilitates a wide range of novel highspeed networking solutions ranging from Packet Classification to Intrusion Detection. While these networking applications span an entirely different domain than the database
applications, they share a common operation namely lookup. Whether it is a database selectivity or join query, the
core of many important database operators involves lookingup for a particular data entry among huge amounts of data.
In this paper, we investigate how CAM-based technology
can help in addressing the existing memory hierarchy bottlenecks in database operations.
CPU speed has been increasing at a much faster pace
than memory access speed over the past decades. Therefore, memory access has become a performance bottleneck
for many applications. To overcome this bottleneck, fast
memories called caches have been inserted into the memory hierarchy between the CPU and main memory. But
recent research on DBMS performance [2, 12, 4] shows that
typical database applications suffer from bad cache performance. Caches exploit the principle of temporal and spatial locality by caching data that has recently been accessed
and by prefetching blocks of data. Since many database
applications have poor data locality, prefetching often pol∗This research has been supported by the NSF grants IIS02-20152, CNF 04-23336 and EIA 00-80134.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
Proceedings of the First International Workshop on Data Management on
New Hardware (DaMoN 2005); June 12, 2005, Baltimore, Maryland, USA.
Copyright 2005 ACM X-XXXXX-XX-X/XX/XX ...$5.00.
lutes precious cache space with irrelevant data. As a consequence, changes in data layout1 and cache-conscious algorithms2 have been proposed. Test results show that cache
performance can be improved by applying these techniques.
However, the gap between CPU speed and memory access
speed is growing larger each year. Therefore we postulate a
change at the hardware level in addition to software techniques.
While the memory hierarchy architecture of a conventional CPU has not changed significantly over the past decade,
the changes are very evident in the case of network router
architectures. NPUs, which form the core of the network
routers, now interface with a Ternary Content Addressable
Memory (TCAM or just a CAM) in a similar fashion as
a conventional processor interfaces with a DRAM. TCAMs
are a new family of memories that, in addition to Read and
Write, offer constant-time “lookups” over large amounts of
data. TCAMs have been successfully deployed in sophisticated routers primarily for their very fast lookup speeds
with the leading solutions providing speeds as high as 133
Million lookups per second. Research efforts in the networking arena show TCAMs being increasingly used to provide
a wide range of high speed networking solutions such as
packet classification, intrusion detection and patternmatching. Although networking and database applications span different domains, the very core of many of these
applications involves “looking-up” for a particular data entry among a huge data collection. In this paper, we initiate
our investigation into whether Content Addressable Memories can alleviate the memory-access latency problems which
persist in the database operations.
We would like to point that database researchers, during 1970s, have worked toward using CAMs for accelerating
database performance . But these devices were not widely
accepted for several reasons such as the small size (typically
few hundred words) and the high expense of the devices .
As the silicon became cheaper and with the increasing demand from the network industry, industry standard TCAMs
can now have a capacity of few megabytes at nominal costs.
Further, the earlier CAMs were binary and did not support
Examples are: alignment with cache lines extraction of
relevant data , cache-conscious redesign of pointer-based
data structures , compression of tuples and storage of
tables as single-row, one-attribute tables .
Examples are: radix-join , prefetching-based hash
join  or algorithms based on blocking, loop fusion and
preliminary extraction .
Figure 1: Search engine cells
storage of ternary words which enable most of the fast networking applications.
Section 2 describes the architecture and functionality of a
TCAM, which forms the core of our architecture called the
CAM-Cache. In Section 3, we present an overview of CAMCache and present the technical details of our prototype
for CAM-Cache. Using the new architecture, we propose a
simple equi-join algorithm, integrate this technique into a
commercial DBMS, and compare its performance with the
software-based solutions provided by this database. In Section 5, we present the evaluation of our technique using our
hardware prototype. We conclude with a description of our
ongoing work of using the architecture for other database
problems and other applications.
A Ternary Content Addressable Memory (TCAM), also
known as a Network Search Engine (NSE), provides Read
/ Write / constant-time Search operations over a large array of data. An NSE basically stores an array of words
and given a query word, it compares the query word with
each of the data entries in the array in parallel in a SIMD
fashion and the first word from the top of the array which
matches the query word is given as the output. A TCAM is
logically very similar to the Translational Look-aside Buffer
(TLB) used to provide fast virtual-memory mapping in almost all architectures. Unlike a TLB which is very small
in size (64 entries), the industry standard TCAMs contain
256K entries. TCAMs can also be cascaded together to provide much larger storage and searches over millions of entries
without incurring any loss in the search through-put.
We now describe the architecture of a TCAM in detail. A
search engine device is logically organized as two-dimensional
array of cells as shown in Figure 1. The data is stored one
entry per row, thus forming a table-like database. Each
cell stores one data bit along with another bit called the
mask bit. The collection of the data entries and the mask
entries are called the Data and the Mask arrays respectively.
The mask bit determines how the value in the data-bit is
used when a search command is issued over the NSE. If the
mask bit is set to 1, the corresponding data bit will be
used in comparison with the corresponding data-bit in the
query word. Other wise, the data bit is not used in the
comparison and the result of the comparison for that particular bit would be 1 (a match). A search operation will
succeed only when all the non-masked data bits match the
corresponding words in the search word. Upon a successful
match, the NSE drives the matching address on its output
lines. If there is more than one hit, the MULTI HIT is as-
The purpose of having mask words is to accomplish the
goal of providing constant-time range matching which is
widely used in IP-forwarding rules. For example, the range
128.111.45.* can be represented by loading the data word
with 184.108.40.206 and the corresponding mask word with
255.255.255.0 (i.e. all 1s except for the last 8 bits which
are set to 0) so that if a query word such as 220.127.116.11
is issued, it matches this data word as all the insignificant
mask bits in the mask word are 0 (don’t care) thereby resulting in the successful match. Apart from the data and
the mask-arrays, a select command to the TCAM also involves another mask register called the Global Mask Register
(GMR). Unlike the mask array, where a mask word has its
masking effect on the corresponding data word, the GMR
effects the search on all the words in the data array. The
purpose of GMR is to enable the search on only particular
data bits. This feature is helpful when the data entry stores
more than one search keys and a search query can decide on
which key to use by masking out the unnecessary bits thus
enabling efficient usage of the TCAM storage space.
Our approach, the CAM-Cache architecture, proposes changes
at both hardware and software levels. CAM-Cache extends
the typical memory hierarchy (disk, memory and caches)
and provides Read, Write and fast constant-time Search
functionality to a software application (e.g. database). At
the software level, we concentrate on developing algorithms
which build on the new hardware functions provided by
CAM-cache. We also integrate the new functionality into
database systems in the form of plug-ins3 . In this section, we
describe the data-model and discuss how CAM-cache provides this functionality. We conclude by a description of the
technical details of our prototype.
CAM-Cache Data Model
Figure 2: CAM-Cache cache-line
The proposed CAM-Cache architecture provides a dataabstraction which exploits the underlying SIMD computation power of a TCAM. A CAM-Cache cache-line, the basic
data unit in the CAM-cache architecture, is different from
a regular L1/L2 cache-line. It consists of two parts: an
extracted-data block and an associative pointer block (See
Figure 2). Data relevant to a database query (i.e., query
column) is stored in a TCAM and is called the extracteddata block. A pointer to the actual memory location of
the data (i.e., record identifier) is stored in the associative
Data-cartridges for Oracle 9i and Data-Blades for DB2.
pointer block (an SRAM) of the same cache-line. Data is
aligned with cache-lines and stored in tables row by row.
Data is thus not stored just as bits and bytes as in RAM
but the actual structure of the data is preserved. Unlike
the L1/L2 cache-lines, the database has control over what
resides in the CAM-cache and how it is stored. Data can
be written to and read from the cache as either a single
cache-line or bursts. Additionally, data can be searched by
content: The constant-time Search operation looks up an argument by content and returns the memory pointer stored
in the associative pointer block of the cache-line where a hit
For example, if we want to retrieve the details of all employees, from a table EMP, whose name is M.Smith. The
SQL statement would be SELECT * FROM EMP WHERE
ENAME=M.Smith. This can be answered using CAM-Cache
in the following way. Load all ENAME column values from
the EMP table into the TCAM and the associated record
identifiers (RIDs) into the SRAM. The query result can be
retrieved by issuing a Search command on the TCAM which
returns the matching results. The details of how each component of CAM-Cache works is given in the following subsections.
CAM-Cache is a novel cache connected to the CPU over
the system bus and consists of three hardware subunits: A
simple FPGA4 also called a bridge processor, an array of
search engine chips and an array of SRAM chips which store
associative data. We note that this architecture involving
an array of search engine chips and SRAM memory chips is
well established and used to provide a variety of networking
solutions[8, 6, 7]. The only difference is that in the case of
a network router, there is no necessity for an FPGA as the
NPU communicates with the CAM directly just as a CPU
communicates with the DRAM. We now give a description of
the role of each subunit and the interactions between them.
Figure 3 shows the block diagram of CAM-Cache.
The Bridge Processor
The bridge processor (FPGA) acts as a bridge between
the CPU on one side and a search engine and SRAM on the
other. It takes the Read/Write/Search commands issued
by the CPU and delivers them to the CAM for processing. In an ideal setup, CPU would directly communicate
with the TCAMs over the system bus in the same way as
an NPU communicates with a CPU. However, the current
CPU architectures allow new devices to be integrated only
through a set of standard interfaces (e.g., PCI) and not directly through the system bus. Henceforth, we need a device
such as the bridge processor which understands this standard interface and interacts with TCAMs. Our research
envisions that, in the future, the TCAMs will interact directly with a CPU over the system bus in the same way as
standard DRAM chips currently interface with a CPU.
This is a reasonable assumption considering how the Graphics Processing Unit (GPU) interface with a CPU evolved
over the last decade. GPUs initially shared the standard
I/O bus along with the other devices such as the hard disk,
network interface card etc. Due to its high I/O requirements, a GPU now interface directly with the CPU through
the system bus. This is made possible through a memory
Figure 3: CAM-Cache architecture
Field Programmable Gate Array
hub , found on any standard chipset, which allows the
main memory, the GPU and the standard I/O hub to access
the system bus with out any changes to the CPU design.
TCAM and SRAM
The TCAM interfaces with the FPGA for recieving and
driving results for the read/write/search commands. In the
case of a Read command, it takes the address as input from
the FPGA and drives the output on its output data lines.
Similarly for a Write command, it takes the input of the
address and the data on its respective input lines and drives
an acknowledgment after the command is processed. For the
Search command, it takes the search key on its input data
lines, processes the search command. If the search succeeds,
it drives the address of the first successful match into the
SRAM address lines which in turn returns the corresponding
data to the FPGA.
Using the Bridge Processor, TCAMs and SRAMs, we built
a prototype for CAM-Cache in order to show the feasibility of such an architecture 5 . In this implementation, the
FPGA interfaces with the CPU through a 32-bit PCI interface running at 33MHz. The FPGA on the other side
interfaces with an array of CYNSE10512 TCAMs  and industry standard SRAMs. The interface between the FPGA
and the NSE/SRAM array operates at 133MHz. The search
engine array in the prototype has only one search engine
(CYNSE10512). While the CYNSE10512 has a capacity of
256K entries of 72-bits each and a search throughput of 133
Million searches per second, the capacity of the search subsystem can be easily extended to 18M entries by cascading
32 of these TCAMs without incurring any decline in the
In this prototype, the CPU communicates with the NSE
array through a combination of bandwidth-limited PCI bus
The CAM-Cache prototype development was done at Cypress Semiconductor Inc, San Jose.
Algorithm 1 Nested-loop join
for each tuple s in S /* outer loop */ do
for each tuple r in R /* inner loop*/ do
test whether r and s join to make a tuple.
and a very naive FPGA implementation, thereby incurring
significant overhead for CAM-Cache reads/writes/searches.
Performance evaluation of the prototype shows a peak throughput of 33000 CAM-Cache searches per second. However
in a network router architecture, where the search engines
interface directly with the network processor, 100 million
searches per second is a realized standard . The performance of CAM-Cache can be significantly improved using a
standard 64-bit PCI-X bus which operates at 533MHz. we
are currently working towards improving the kernel driver
performance for the CAM-Cache.
In addition to the hardware level contributions, the main
focus of our research is to build database functionality on top
of the hardware functions provided by CAM-Cache. This involves developing novel CAM-Cache based algorithms for
complex database problems and evaluating them against
software-based solutions. In this paper, we propose a CAMCache based technique for the database equi-join problem
and present an evaluation of this algorithm. Building upon
our earlier framework for integration of novel techniques into
a commercial database , we integrate the CAM-Cachebased equi-join into a popular commercial DBMS and evaluate its performance with the software solutions provided
by the DBMS.
We now describe the equi-join algorithm developed over
CAM-Cache. We modified the simple nested-loop join (NLJ)
algorithm by replacing the expensive and cache-inefficient
inner loop with a CAM-Cache-lookup. The basic algorithms
are shown in Algorithms 1 and 2. The CAM-Cache-based
Join (CJ) algorithm assumes that the join attributes of R are
stored in CAM-Cache as extracted data. In the case when
the size of R is more than the TCAM space, we partition the
relation of R into blocks which fit into the TCAM and apply
the CJ algorithm with each of these blocks and the relation
We have chosen NLJ for the following reasons: The joinphase6 of NLJ does not depend on the join selectivity and
the algorithm performs well for high join selectivity. NLJ
does not take advantage of table organization or additional
data structures and thus represents raw cost. Moreover,
the iteration in the inner loop, which we replace by a hardware lookup, is a basic operation encountered in many other
database operations. But from a join performance point of
view, comparison with the NLJ algorithm is not sufficient
enough : we follow it by experiments evaluating the CJ with
more sophisticated join algorithms, like indexed NLJ, sortmerge join and hash-join.
The CYNSE10512 chip only supports single-match: In
each Search only one associative pointer is returned (even
The join-phase will be discussed in Section 4.3.
Algorithm 2 CAM-Cache-based join
for each tuple s in S do
lookup s in R in the CAM-Cache
if there are multiple hits). Many database operations require multiple-match. In a join, for instance, a tuple of the
outer relation usually matches several tuples of the inner relation. We have implemented a CJ algorithm that supports
multiple-match at the software level. However, this is very
expensive and not the preferred method. But at the current
stage, this represents a reasonable feasible option.
Algorithm 3 CAM-Cache-based join
for each tuple s in S do
Test for match (SSF)
while MULTI-HIT do
Test for match (SSF)
The algorithm is shown in Algorithm 3. After each CAMCache Search operation, the SSF (search successful flag)
is read to see whether a match occurred. Likewise, the
MULTI_HIT flag indicates multiple hits. If it is set, the SSR
(search successful register), which stores the location of the
current hit, is used to tag the search engine location of the
hit entry. The location has to be tagged to make sure that
the next Search operation returns the next match and not
the same one again. The inner while-loop is executed until
MULTI_HIT indicates no more matches. Then all the tags
have to be removed.
We integrated the NLJ algorithm into a commercial database
in order to evaluate its performance against high-performance
software-based solutions provided by this database. We use
the database extensibility techniques offered by most commercial databases in order to make our NLJ algorithm a part
of the database query processing techniques. As a result of
this integration, we were able to compare the performance
of our technique against very efficient software-based techniques such as nested loop join, nested-indexed join, hash
join and sort-merge joins.
Execution of a Join Operation
We split the join operation Q=R 1 S into two phases: In
the first phase, which we call the join phase, the actual join
is performed, i.e. join attributes of one table are matched
against tuples from the other table. The result of the join
phase is a two-column table: For each matching tuple of
join attributes, a pointer to the R-tuple is stored in the first
column and a pointer to the S-tuple is stored in the corresponding second column. In the second phase, which we
refer to as the construction phase, the actual result table Q
is constructed and returned to the user. Most research efforts concentrate on improving the join phase. In some cases
the construction phase can actually be omitted. Whether it
is favorable to perform the construction phase depends on
the operation that follows the join operation in the query
There is no general rule or estimate of how expensive the
construction phase is compared to the join phase, since the
time spent on the construction phase depends on the join
selectivity. The join selectivity is the ratio of the size (number of tuples) of the join file to the size of the Cartesian
product file: js = |(R 1 S)|/|(R × S)| = |(R 1 S)|/(|R| ∗ |S|)
Therefore 0 ≤ js ≤ 1 holds.
Given the join selectivity, the number of tuples accessed
can be estimated. For instance, for NLJ, the number of
tuples accessed, t, would be t = |R| ∗ |S| + (js ∗ |R| ∗ |S|).
The join phase costs |R| ∗ |S| accesses and the construction
phase (js ∗ |R| ∗ |S|) accesses. In terms of tuple accesses, the
cost of the construction phase in NLJ is lesser than or equal
to the cost of the join phase. But since key attributes are
often the join attributes, the cost of the construction phase
is typically linear in the size of one of the tables.
Sometimes it is easier to use what we call the join factor.
The join factor jf is the average number of tuples from S
that a tuple from R matches. If |R| = |S| then jf = |R| ∗ js.
In this paper, we use the join factor as a parameter for
measuring the performance of joins.
We conducted the experiments on a single CPU Intel Pentium IV 2.4GHz PC (). The system bus speed is 533MHz.
The CAM-Cache prototype interfaces with the CPU through
a 32-bit 33MHz PCI interface. For the purpose of accessing
the CAM-Cache prototype, drivers were developed on the
Linux platform. The CJ algorithm is implemented in C and
compiled using the GNU C compiler. The experiments are
evaluated on the prototype described in Section 3.2.3. In the
first set of experiments, we show the results of comparison
between CJ and software NLJ with different parameters. We
follow-up these experiments with a comparison with other
We present our results in two forms: (a) We report the
performance results of the CJ algorithm as is from the current prototype (b) We also report the results of the CJ algorithm with the PCI overhead deducted from the results
reported in (a). We show the results of case (b) because the
current prototype suffers from excessive PCI overhead and
this could have been avoided had the TCAMs interfaced directly with the CPU as they do with NPUs. We refer to the
performance of the CJ algorithm in case(a) as CJ and the
results in case(b) as CJnoPCI.
The data sets are randomly generated in the following
manner: Each record consists of a join attribute (1 word)
and (record-width−1) words of random data to simulate
cache pollution. The join attribute is a uniformly distributed
random number in a range that determines the join selectivity, i.e. how many tuples on the average have the same
value. All experiments are performed over two tables that
have the same number of entries.
We now present a preliminary evaluation of the CJ algorithm. In this experiment, we assume a fixed join factor and
vary the table size from 10K entries to 100K entries. The
two tables have the same size for all experiments. 256K is
the maximum number of entries of a single CYNSE10512
chip. Figure 4 shows the results for join factor 1 (i.e. the
size of the result-table |Q|, with Q = R 1 S equals |R|, with
|R| = |S|). In the Figure 4(a), we show the actual performance returned by the CJ algorithm when executed on the
prototype and compare it with a software nested-loop algorithm. These results show the advantage of constant-time
lookups. With a join factor of 1 and a table size of 100K,
CJ with a query elapse time of 7.02 seconds on the prototype is 110 times faster than the software-based NLJ which
took 741 seconds. It should be noted that CJ outperforms
the software NLJ despite the PCI overhead incurred in the
execution of CJ.
In Figure 4(b), we show the estimated performance of the
algorithm with the PCI overhead conservatively discounted
and compare its performance with popular join techniques
such as the hash join, sort-merge join and the nested indexed join. These results show that, at a join factor of 1,
the CJnoPCI algorithm performs as well if not better than
the other techniques. Note the difference in the y-axis ranges
of the Figures 4(a) and (b). The differences in the performance of CJ (7.02 seconds) and CJnoPCI (0.183 seconds)
algorithms on a table size of 100K tuples(see Figures 4(a)
and (b)) show that the current CJ algorithm incurs excessive
PCI overhead which dominates the cost of the CJ algorithm.
It should be noted that we deducted the PCI overhead very
conservatively which means that the true performance of
the CJ algorithm on a network processor will be much better than the CJnoPCI. The results of the same experiment
with a join factor of 12 (|Q| = 12∗|R|) are shown in Figure 5. With a table size of 100K, the difference in the elapsed
time of CJnoPCI join in Figures 5(b) and 4(b) shows that
the performance of CJnoPCI degrades with increasing join
factor. This is because the while-loop for the mult-match
involved in the join phase is very expensive and its cost increases linearly with the join factor.
Analysis of Results
On the whole, these experiments show that the CJ algorithm can perform very well when compared with popular
join techniques. However, we consider these experiments to
be preliminary as we have not compared the performance of
CJ with that of other modern techniques [13, 12] for equijoin. We also need to explore the conditions under which
CJ performs better than the other techniques. However,
it should be noted that CJ is a simple technique which
is based on a most simple look-up feature of the TCAM.
There are still a lot of features such as masking and rangematching, which enable many novel networking solutions [7,
8, 6] and are yet to be exploited for solving database problems. This paper basically builds the infrastructure for exploring how the TCAM technology can help solving complex database operations and other data-oriented applications such as data-streams.
CONCLUSION AND FUTURE WORK
In this paper, we make an attempt to address the problem
of “Where does the time go in a DBMS?”  by proposing changes at both hardware and software levels. We propose to exploit the Content Addressable Memory technology, used widely in network router architectures, in order to
alleviate the bottlenecks in current processor architectures.
We proposed a novel architecture called CAM-Cache, which
10000 20000 30000 40000 50000 60000 70000 80000 90000 100000
10000 20000 30000 40000 50000 60000 70000 80000 90000 100000
Elapsed time (seconds)
Elapsed time (seconds)
Elapsed time (seconds)
Elapsed time (seconds)
10000 20000 30000 40000 50000 60000 70000 80000 90000 100000
10000 20000 30000 40000 50000 60000 70000 80000 90000 100000
Figure 4: Running-times with varying table sizes :
Figure 5: Running-times with varying table sizes :
integrates CAM into the caching hierarchy of a conventional
processor. We discuss various data layout schemes provided
by this architecture and how it enables new database algorithms. Using this architecture, we proposed a very simple
hardware-based equi-join algorithm by modifying the inner
iteration of a software nested loop join. In order to show the
feasibility of building the proposed architecture, we built a
prototype of CAM-Cache using custom hardware. We evaluated this architecture by developing the hardware-based
join on the prototype and comparing its performance with
various software-based join techniques. Through our preliminary performance analysis, we showed that despite the
various overheads of the current prototype, this architecture has the potential for providing novel efficient solutions
for existing problems. In this paper, we presented a simple technique based on a very basic feature of TCAM while
many other features such as range-matching are yet to be
exploited. Using this setup, we will explore the possibilities
of providing alternative solutions for database join and other
complex database problems. We will also explore the application of this setup in other domains such as data streams.
 Intel 915x Chipset Architecture.
 A.G.Ailamaki, D.J.DeWitt, M.D.Hill, and D.A.Wood.
DBMSs On a Modern Processor: Where Does Time
Go? In Procedings of the 25th International
Conference on Very Large Data Bases, pages 266–277,
 Anderson.G.A. and Kain.R.Y. ’A content-addressed
memory design for data base applications’. In IEEE
International Conference on Parallel Processing, pages
A.Shatdal, C.Kant, and J.Naughton. Cache Conscious
Algorithms for Relational Query Processing. In
VLDB, pages 510–512, 1994.
Cypress Semiconductor Corporation. CYNSE10512
Network Search Engine.
F.Yu, R.H.Katz, and T.V.Lakshman. Gigabit Rate
Packet Pattern-Matching Using TCAM. In IEEE
R. F.Yu. Efficient Multi-Match Packet Classification
with TCAM. In IEEE Hot Interconnects 2004, 2004.
G.J.Narlikar, A.Basu, and F.Zane. Coolcams:
Power-efficient tcams for forwarding engines. In
Intel Corporation. Intel Pentium IV Processor.
K.E.Grosspietsch. Associative processors and
memories: A survey. IEEE Micro, 12(3), 1992.
N.Bandi, C.Sun, D.Agrawal, and A.El Abbadi.
Hardware Acceleration in Commercial Databases: A
Case Study of Spatial Operations. In VLDB, 2004.
P.Boncz, S.Manegold, and M.L.Kersten. Database
Architecture Optimized for the New Bottleneck:
Memory Access. In International Conference on Very
Large Data Bases, pages 266–277, 1999.
S.Chen, A.Ailamaki, P.B.Gibbons, and T.C.Mowry.
Improving Hash Join Performance through
Prefetching. In ICDE, pages 116–127, 2004.
T.M.Chilimbi, M.D.Hill, and J.R.Larus. Cache -
Conscious Structure Layout. In Proceedings of ACM
SIGPLAN 1999, pages 13–24, 1999.