Patent application number | Description | Published |
20090049286 | DATA PROCESSING SYSTEM, PROCESSOR AND METHOD OF DATA PROCESSING HAVING IMPROVED BRANCH TARGET ADDRESS CACHE - A processor includes an execution unit and instruction sequencing logic that fetches instructions from a memory system for execution. The instruction sequencing logic includes branch logic that outputs predicted branch target addresses for use as instruction fetch addresses. The branch logic includes a level one branch target address cache (BTAC) and a level two BTAC each having a respective plurality of entries each associating at least a tag with a predicted branch target address. The branch logic accesses the level one and level two BTACs in parallel with a tag portion of a first instruction fetch address to obtain a first predicted branch target address from the level one BTAC for use as a second instruction fetch address in a first processor clock cycle and a second predicted branch target address from the level two BTAC for use as a third instruction fetch address in a later second processor clock cycle. | 02-19-2009 |
20090063811 | System for Data Processing Using a Multi-Tiered Full-Graph Interconnect Architecture - A system is provided for implementing a multi-tiered full-graph interconnect architecture. In order to implement a multi-tiered full-graph interconnect architecture, a plurality of processors are coupled to one another to create a plurality of processor books. The plurality of processor books are coupled together to create a plurality of supernodes. Then, the plurality of supernodes are coupled together to create the multi-tiered full-graph interconnect architecture. Data is then transmitted from one processor to another within the multi-tiered full-graph interconnect architecture based on an addressing scheme that specifies at least a supernode and a processor book associated with a target processor to which the data is to be transmitted. | 03-05-2009 |
20090063814 | System and Method for Routing Information Through a Data Processing System Implementing a Multi-Tiered Full-Graph Interconnect Architecture - A method, computer program product, and system are provided for routing information through the data processing system. Data is received at a source processor within a set of processors that is to be transmitted to a destination processor, where the data includes address information. A first determination is performed as to whether the destination processor is within a same processor book as the source processor based on the address information. A second determination is performed as to whether the destination processor is within a same supernode as the source processor based on the address information if the destination processor is not within the same processor book. A routing path is identified for the data based on results of the first determination, the second determination, and one or more routing table data structures. The data is then transmitted from the source processor along the identified routing path toward the destination processor. | 03-05-2009 |
20090063815 | System and Method for Providing Full Hardware Support of Collective Operations in a Multi-Tiered Full-Graph Interconnect Architecture - A method, computer program product, and system are provided for performing collective operations. In hardware of a parent processor in a first processor book, a number of other processors are determined in a same or different processor book of the data processing system that is needed to execute the collective operation, thereby establishing a plurality of processors comprising the parent processor and the other processors. In hardware of the parent processor, the plurality of processors are logically arranged as a plurality of nodes in a hierarchical structure. The collective operation is transmitted to the plurality of processors based on the hierarchical structure. In hardware of the parent processor, results are received from the execution of the collective operation from the other processors, a final result is generated of the collective operation based on the received results, and the final result is output. | 03-05-2009 |
20090063816 | System and Method for Performing Collective Operations Using Software Setup and Partial Software Execution at Leaf Nodes in a Multi-Tiered Full-Graph Interconnect Architecture - A method, computer program product, and system are provided for performing collective operations. In software executing on a parent processor in a first processor book, a number of other processors are determined in a same or different processor book of the data processing system that is needed to execute the collective operation, thereby establishing a plurality of processors comprising the parent processor and the other processors. In software executing on the parent processor, the plurality of processors are logically arranged as a plurality of nodes in a hierarchical structure. The collective operation is transmitted to the plurality of processors based on the hierarchical structure. In hardware of the parent processor, results are received from the execution of the collective operation from the other processors, a final result is generated of the collective operation based on the received results, and the final result is output. | 03-05-2009 |
20090063885 | System and Computer Program Product for Modifying an Operation of One or More Processors Executing Message Passing Interface Tasks - A system and computer program product for modifying an operation of one or more processors executing message passing interface (MPI) tasks are provided. Mechanisms for adjusting the balance of processing workloads of the processors are provided so as to minimize wait periods for waiting for all of the processors to call a synchronization operation. Each processor has an associated hardware implemented MPI load balancing controller. The MPI load balancing controller maintains a history that provides a profile of the tasks with regard to their calls to synchronization operations. From this information, it can be determined which processors should have their processing loads lightened and which processors are able to handle additional processing loads without significantly negatively affecting the overall operation of the parallel execution system. As a result, operations may be performed to shift workloads from the slowest processor to one or more of the faster processors. | 03-05-2009 |
20090064139 | Method for Data Processing Using a Multi-Tiered Full-Graph Interconnect Architecture - A method is provided for implementing a multi-tiered full-graph interconnect architecture. In order to implement a multi-tiered full-graph interconnect architecture, a plurality of processors are coupled to one another to create a plurality of processor books. The plurality of processor books are coupled together to create a plurality of supernodes. Then, the plurality of supernodes are coupled together to create the multi-tiered full-graph interconnect architecture. Data is then transmitted from one processor to another within the multi-tiered full-graph interconnect architecture based on an addressing scheme that specifies at least a supernode and a processor book associated with a target processor to which the data is to be transmitted. | 03-05-2009 |
20090064165 | Method for Hardware Based Dynamic Load Balancing of Message Passing Interface Tasks - A method for providing hardware based dynamic load balancing of message passing interface (MPI) tasks are provided. Mechanisms for adjusting the balance of processing workloads of the processors executing tasks of an MPI job are provided so as to minimize wait periods for waiting for all of the processors to call a synchronization operation. Each processor has an associated hardware implemented MPI load balancing controller. The MPI load balancing controller maintains a history that provides a profile of the tasks with regard to their calls to synchronization operations. From this information, it can be determined which processors should have their processing loads lightened and which processors are able to handle additional processing loads without significantly negatively affecting the overall operation of the parallel execution system. As a result, operations may be performed to shift workloads from the slowest processor to one or more of the faster processors. | 03-05-2009 |
20090064166 | System and Method for Hardware Based Dynamic Load Balancing of Message Passing Interface Tasks - A system and method for providing hardware based dynamic load balancing of message passing interface (MPI) tasks are provided. Mechanisms for adjusting the balance of processing workloads of the processors executing tasks of an MPI job are provided so as to minimize wait periods for waiting for all of the processors to call a synchronization operation. Each processor has an associated hardware implemented MPI load balancing controller. The MPI load balancing controller maintains a history that provides a profile of the tasks with regard to their calls to synchronization operations. From this information, it can be determined which processors should have their processing loads lightened and which processors are able to handle additional processing loads without significantly negatively affecting the overall operation of the parallel execution system. As a result, operations may be performed to shift workloads from the slowest processor to one or more of the faster processors. | 03-05-2009 |
20090064167 | System and Method for Performing Setup Operations for Receiving Different Amounts of Data While Processors are Performing Message Passing Interface Tasks - A system and method are provided for performing setup operations for receiving a different amount of data while processors are performing message passing interface (MPI) tasks. Mechanisms for adjusting the balance of processing workloads of the processors are provided so as to minimize wait periods for waiting for all of the processors to call a synchronization operation. An MPI load balancing controller maintains a history that provides a profile of the tasks with regard to their calls to synchronization operations. From this information, it can be determined which processors should have their processing loads lightened and which processors are able to handle additional processing loads without significantly negatively affecting the overall operation of the parallel execution system. As a result, setup operations may be performed while processors are performing MPI tasks to prepare for receiving different sized portions of data in a subsequent computation cycle based on the history. | 03-05-2009 |
20090064168 | System and Method for Hardware Based Dynamic Load Balancing of Message Passing Interface Tasks By Modifying Tasks - A system and method are provided for providing hardware based dynamic load balancing of message passing interface (MPI) tasks by modifying tasks. Mechanisms for adjusting the balance of processing workloads of the processors executing tasks of an MPI job are provided so as to minimize wait periods for waiting for all of the processors to call a synchronization operation. Each processor has an associated hardware implemented MPI load balancing controller. The MPI load balancing controller maintains a history that provides a profile of the tasks with regard to their calls to synchronization operations. From this information, it can be determined which processors should have their processing loads lightened and which processors are able to handle additional processing loads without significantly negatively affecting the overall operation of the parallel execution system. Thus, operations may be performed to shift workloads from the slowest processor to one or more of the faster processors. | 03-05-2009 |
20090077354 | Techniques for Predicated Execution in an Out-of-Order Processor - A technique for handling predicated code in an out-of-order processor includes detecting a predicate defining instruction associated with a predicated code region. Renaming of predicated instructions, within the predicated code region, is then stalled until a predicate of the predicate defining instruction is resolved. | 03-19-2009 |
20090113164 | Method, System and Program Product for Address Translation Through an Intermediate Address Space - In a data processing system capable of concurrently executing multiple hardware threads of execution, an intermediate address translation unit in a processing unit translates an effective address for a memory access into an intermediate address. A cache memory is accessed utilizing the intermediate address. In response to a miss in cache memory, the intermediate address is translated into a real address by a real address translation unit that performs address translation for multiple hardware threads of execution. The system memory is accessed with the real address. | 04-30-2009 |
20090132767 | COMPLIER ASSISTED VICTIM CACHE BYPASSING - A method for compiler assisted victim cache bypassing including: identifying a cache line as a candidate for victim cache bypassing; conveying a bypassing-the-victim-cache information to a hardware; and checking a state of the cache line to determine a modified state of the cache line, wherein the cache line is identified for cache bypassing if the cache line that has no reuse within a loop or loop nest and there is no immediate loop reuse or there is a substantial across loop reuse distance so that it will be replaced from both main and victim cache before being reused. | 05-21-2009 |
20090198903 | DATA PROCESSING SYSTEM, PROCESSOR AND METHOD THAT VARY AN AMOUNT OF DATA RETRIEVED FROM MEMORY BASED UPON A HINT - In at least one embodiment, a processor detects during execution of program code whether a load instruction within the program code is associated with a hint. In response to detecting that the load instruction is not associated with a hint, the processor retrieves a full cache line of data from the memory hierarchy into the processor in response to the load instruction. In response to detecting that the load instruction is associated with a hint, a processor retrieves a partial cache line of data into the processor from the memory hierarchy in response to the load instruction. | 08-06-2009 |
20090198904 | Techniques for Data Prefetching Using Indirect Addressing with Offset - A technique for performing data prefetching using indirect addressing includes determining a first memory address of a pointer associated with a data prefetch instruction. Content, that is included in a first data block (e.g., a first cache line) of a memory, at the first memory address is then fetched. An offset is then added to the content of the memory at the first memory address to provide a first offset memory address. A second memory address is then determined based on the first offset memory address. A second data block (e.g., a second cache line) that includes data at the second memory address is then fetched (e.g., from the memory or another memory). A data prefetch instruction may be indicated by a unique operational code (opcode), a unique extended opcode, or a field (including one or more bits) in an instruction. | 08-06-2009 |
20090198905 | Techniques for Prediction-Based Indirect Data Prefetching - A technique for data prefetching using indirect addressing includes monitoring data pointer values, associated with an array, in an access stream to a memory. The technique determines whether a pattern exists in the data pointer values. A prefetch table is then populated with respective entries that correspond to respective array address/data pointer pairs based on a predicted pattern in the data pointer values. Respective data blocks (e.g., respective cache lines) are then prefetched (e.g., from the memory or another memory) based on the respective entries in the prefetch table. | 08-06-2009 |
20090198906 | Techniques for Multi-Level Indirect Data Prefetching - A technique for performing data prefetching using multi-level indirect data prefetching includes determining a first memory address of a pointer associated with a data prefetch instruction. Content that is included in a first data block (e.g., a first cache line of a memory) at the first memory address is then fetched. A second memory address is then determined based on the content at the first memory address. Content that is included in a second data block (e.g., a second cache line) at the second memory address is then fetched (e.g., from the memory or another memory). A third memory address is then determined based on the content at the second memory address. Finally, a third data block (e.g., a third cache line) that includes another pointer or data at the third memory address is fetched (e.g., from the memory or the another memory). | 08-06-2009 |
20090198907 | Dynamic Adjustment of Prefetch Stream Priority - A method, processor, and data processing system for dynamically adjusting a prefetch stream priority based on the consumption rate of the data by the processor. The method includes a prefetch engine issuing a prefetch request of a first prefetch stream to fetch one or more data from the memory subsystem. The first prefetch stream has a first assigned priority that determines a relative order for scheduling prefetch requests of the first prefetch stream relative to other prefetch requests of other prefetch streams. Based on the receipt of a processor demand for the data before the data returns to the cache or return of the data along time before the receiving the processor demand, logic of the prefetch engine dynamically changes the first assigned priority to a second higher or lower priority, which priority is subsequently utilized to schedule and issue a next prefetch request of the first prefetch stream. | 08-06-2009 |
20090198909 | Jump Starting Prefetch Streams Across Page Boundaries - A method, processor, and data processing system for enabling utilization of a single prefetch stream to access data across a memory page boundary. A prefetch engine includes an active streams table in which information for one or more scheduled prefetch streams are stored. The prefetch engine also includes a victim table for storing a previously active stream whose next prefetch crosses a memory page boundary. The scheduling logic issues a prefetch request with a real address to fetch data from the lower level memory. Then, responsive to detecting that the real address of the stream's next sequential prefetch crosses the memory page boundary, the prefetch engine determines when the first prefetch stream can continue across the page boundary of the first memory page (via an effective address comparison). The PE automatically reinserts the first prefetch stream into the active stream table to jump start prefetching across the page boundary. | 08-06-2009 |
20090198910 | DATA PROCESSING SYSTEM, PROCESSOR AND METHOD THAT SUPPORT A TOUCH OF A PARTIAL CACHE LINE OF DATA - According to method of data processing in a multiprocessor data processing system, in response to a processor touch request targeting a target granule of a cache line of data containing multiple granules, a processing unit originates on an interconnect of the multiprocessor data processing system a partial touch request that requests a copy of only the target granule for subsequent query access. In response to a combined response to the partial touch request indicating success, the combined response representing a system-wide response to the partial touch request, the processing unit receives the target granule of the target cache line and updates a coherency state of the target granule while retaining a coherency state of at least one other granule of the cache line. | 08-06-2009 |
20090198948 | Techniques for Data Prefetching Using Indirect Addressing - A technique for performing indirect data prefetching includes determining a first memory address of a pointer associated with a data prefetch instruction. Content of a memory at the first memory address is then fetched. A second memory address is determined from the content of the memory at the first memory address. Finally, a data block (e.g., a cache line) including data at the second memory address is fetched (e.g., from the memory or another memory). | 08-06-2009 |
20090198950 | Techniques for Indirect Data Prefetching - A processor includes a first address translation engine, a second address translation engine, and a prefetch engine. The first address translation engine is configured to determine a first memory address of a pointer associated with a data prefetch instruction. The prefetch engine is coupled to the first translation engine and is configured to fetch content, included in a first data block (e.g., a first cache line) of a memory, at the first memory address. The second address translation engine is coupled to the prefetch engine and is configured to determine a second memory address based on the content of the memory at the first memory address. The prefetch engine is also configured to fetch (e.g., from the memory or another memory) a second data block (e.g., a second cache line) that includes data at the second memory address. | 08-06-2009 |
20090198956 | System and Method for Data Processing Using a Low-Cost Two-Tier Full-Graph Interconnect Architecture - A system and method are provided for implementing a two-tier full-graph interconnect architecture. In order to implement a two-tier full-graph interconnect architecture, a plurality of processors are coupled to one another to create a plurality of supernodes. Then, the plurality of supernodes are coupled together to create the two-tier full-graph interconnect architecture. Data is then transmitted from one processor to another within the two-tier full-graph interconnect architecture based on an addressing scheme that specifies at least a supernode and a processor chip identifier associated with a target processor to which the data is to be transmitted. | 08-06-2009 |
20090198965 | METHOD AND SYSTEM FOR SOURCING DIFFERING AMOUNTS OF PREFETCH DATA IN RESPONSE TO DATA PREFETCH REQUESTS - According to a method of data processing, a memory controller receives a prefetch load request from a processor core of a data processing system. The prefetch load request specifies a requested line of data. In response to receipt of the prefetch load request, the memory controller determines by reference to a stream of demand requests how much data is to be supplied to the processor core in response to the prefetch load request. In response to the memory controller determining to provide less than all of the requested line of data, the memory controller provides less than all of the requested line of data to the processor core. | 08-06-2009 |
20100030973 | CACHE DIRECTED SEQUENTIAL PREFETCH - A technique for performing stream detection and prefetching within a cache memory simplifies stream detection and prefetching. A bit in a cache directory or cache entry indicates that a cache line has not been accessed since being prefetched and another bit indicates the direction of a stream associated with the cache line. A next cache line is prefetched when a previously prefetched cache line is accessed, so that the cache always attempts to prefetch one cache line ahead of accesses, in the direction of a detected stream. Stream detection is performed in response to load misses tracked in the load miss queue (LMQ). The LMQ stores an offset indicating a first miss at the offset within a cache line. A next miss to the line sets a direction bit based on the difference between the first and second offsets and causes prefetch of the next line for the stream. | 02-04-2010 |
20100115204 | NON-UNIFORM CACHE ARCHITECTURE (NUCA) - In one embodiment, a cache memory includes a cache array including a plurality of entries for caching cache lines of data, where the plurality of entries are distributed between a first region implemented in a first memory technology and a second region implemented in a second memory technology. The cache memory further includes a cache directory of the contents of the cache array and a cache controller that controls operation of the cache memory. | 05-06-2010 |
20100268788 | Remote Asynchronous Data Mover - A distributed data processing system executes multiple tasks within a parallel job, including a first local task on a local node and at least one task executing on a remote node, with a remote memory having real address (RA) locations mapped to one or more of the source effective addresses (EA) and destination EA of a data move operation initiated by a task executing on the local node. On initiation of the data move operation, remote asynchronous data move (RADM) logic identifies that the operation moves data to/from a first EA that is memory mapped to an RA of the remote memory. The local processor/RADM logic initiates a RADM operation that moves a copy of the data directly from/to the first remote memory by completing the RADM operation using the network interface cards (NICs) of the source and destination processing nodes, determined by accessing a data center for the node IDs of remote memory. | 10-21-2010 |
20100293339 | DATA PROCESSING SYSTEM, PROCESSOR AND METHOD FOR VARYING A DATA PREFETCH SIZE BASED UPON DATA USAGE - A method of data processing in a processor includes maintaining a usage history indicating demand usage of prefetched data retrieved into cache memory. An amount of data to prefetch by a data prefetch request is selected based upon the usage history. The data prefetch request is transmitted to a memory hierarchy to prefetch the selected amount of data into cache memory. | 11-18-2010 |
20110004875 | Method and System for Performance Isolation in Virtualized Environments - A method, a system, an apparatus, and a computer program product for allocating resources of one or more shared devices to one or more partitions of a virtualization environment within a data processing system. At least one user defined resource assignment is received for one or more devices associated with the data processing system. One or more registers, associated with the one or more partitions are dynamically set to execute the at least one resource assignment, whereby the at least one resource assignment enables a user defined quantitative measure (number and/or percentage) of devices to operate when the one or more transactions are executed via the partition. The system enables the one or more devices to execute one or more transactions at a bandwidth/capacity that is less than or equal to the user defined resource assignment and minimizes performance interference among partitions. | 01-06-2011 |
20110022773 | Fine Grained Cache Allocation - A mechanism is provided in a virtual machine monitor for fine grained cache allocation in a shared cache. The mechanism partitions a cache tag into a most significant bit (MSB) portion and a least significant bit (LSB) portion. The MSB portion of the tags is shared among the cache lines in a set. The LSB portion of the tags is private, one per cache line. The mechanism allows software to set the MSB portion of tags in a cache to allocate sets of cache lines. The cache controller determines whether a cache line is locked based on the MSB portion of the tag. | 01-27-2011 |
20110072214 | Read and Write Aware Cache - A mechanism is provided in a cache for providing a read and write aware cache. The mechanism partitions a large cache into a read-often region and a write-often region. The mechanism considers read/write frequency in a non-uniform cache architecture replacement policy. A frequently written cache line is placed in one of the farther banks. A frequently read cache line is placed in one of the closer banks. The size ratio between read-often and write-often regions may be static or dynamic. The boundary between the read-often region and the write-often region may be distinct or fuzzy. | 03-24-2011 |
20110145509 | CACHE DIRECTED SEQUENTIAL PREFETCH - A technique for performing stream detection and prefetching within a cache memory simplifies stream detection and prefetching. A bit in a cache directory or cache entry indicates that a cache line has not been accessed since being prefetched and another bit indicates the direction of a stream associated with the cache line. A next cache line is prefetched when a previously prefetched cache line is accessed, so that the cache always attempts to prefetch one cache line ahead of accesses, in the direction of a detected stream. Stream detection is performed in response to load misses tracked in the load miss queue (LMQ). The LMQ stores an offset indicating a first miss at the offset within a cache line. A next miss to the line sets a direction bit based on the difference between the first and second offsets and causes prefetch of the next line for the stream. | 06-16-2011 |
20110238946 | Data Reorganization through Hardware-Supported Intermediate Addresses - A virtual address scheme for improving performance and efficiency of memory accesses of sparsely-stored data items in a cached memory system is disclosed. In a preferred embodiment of the present invention, a special address translation unit is used to translate sets of non-contiguous addresses in real memory into contiguous blocks of addresses in an “intermediate address space.” This intermediate address space is a fictitious or “virtual” address space, but is distinguishable from the virtual address space visible to application programs, and in user-level memory operations, effective addresses seen/manipulated by application programs are translated into intermediate addresses by an additional address translation unit for memory caching purposes. This scheme allows non-contiguous data items in memory to be assembled into contiguous cache lines for more efficient caching/access (due to the perceived spatial proximity of the data from the perspective of the processor). | 09-29-2011 |
20110292594 | Scalable Space-Optimized and Energy-Efficient Computing System - A scalable space-optimized and energy-efficient computing system is provided. The computing system comprises a plurality of modular compartments in at least one level of a frame configured in a hexadron configuration. The computing system also comprises an air inlet, an air mixing plenum, and at least one fan. In the computing system the plurality of modular compartments are affixed above the air inlet, the air mixing plenum is affixed above the plurality of modular compartments, and the at least one fan is affixed above the air mixing plenum. When at least one module is inserted into one of the plurality of modular compartments, the module couples to a backplane within the frame. | 12-01-2011 |
20110292597 | Stackable Module for Energy-Efficient Computing Systems - A modular processing module is provided. The modular processing module comprises a set of processing module sides. Each processing module side comprises a circuit board, a plurality of connectors coupled to the circuit board, and a plurality of processing nodes coupled to the circuit board. Each processing module side in the set of processing module sides couples to another processing module side using at least one connector in the plurality of connectors such that, when all of the set of processing module sides are coupled together, the modular processing module is formed. The modular processing module comprises an exterior connection to a power source and a communication system. | 12-01-2011 |
20110296097 | Mechanisms for Reducing DRAM Power Consumption - Mechanisms are provided for inhibiting precharging of memory cells of a dynamic random access memory (DRAM) structure. The mechanisms receive a command for accessing memory cells of the DRAM structure. The mechanisms further determine, based on the command, if precharging the memory cells following accessing the memory cells is to be inhibited. Moreover, the mechanisms send, in response to the determination indicating that precharging the memory cells is to be inhibited, a command to blocking logic of the DRAM structure to block precharging of the memory cells following accessing the memory cells. | 12-01-2011 |
20110296107 | Latency-Tolerant 3D On-Chip Memory Organization - A mechanism is provided within a 3D stacked memory organization to spread or stripe cache lines across multiple layers. In an example organization, a 128B cache line takes eight cycles on a 16B-wide bus. Each layer may provide 32B. The first layer uses the first two of the eight transfer cycles to send the first 32B. The next layer sends the next 32B using the next two cycles of the eight transfer cycles, and so forth. The mechanism provides a uniform memory access. | 12-01-2011 |
20110296112 | Reducing Energy Consumption of Set Associative Caches by Reducing Checked Ways of the Set Association - Mechanisms for accessing a set associative cache of a data processing system are provided. A set of cache lines, in the set associative cache, associated with an address of a request are identified. Based on a determined mode of operation for the set, the following may be performed: determining if a cache hit occurs in a preferred cache line without accessing other cache lines in the set of cache lines; retrieving data from the preferred cache line without accessing the other cache lines in the set of cache lines, if it is determined that there is a cache hit in the preferred cache line; and accessing each of the other cache lines in the set of cache lines to determine if there is a cache hit in any of these other cache lines only in response to there being a cache miss in the preferred cache line(s). | 12-01-2011 |
20110296115 | Assigning Memory to On-Chip Coherence Domains - A mechanism is provided for assigning memory to on-chip cache coherence domains. The mechanism assigns caches within a processing unit to coherence domains. The mechanism then assigns chunks of memory to the coherence domains. The mechanism monitors applications running on cores within the processing unit to identify needs of the applications. The mechanism may then reassign memory chunks to the cache coherence domains based on the needs of the applications running in the coherence domains. When a memory controller receives the cache miss, the memory controller may look up the address in a lookup table that maps memory chunks to cache coherence domains. Snoop requests are sent to caches within the coherence domain. If a cache line is found in a cache within the coherence domain, the cache line is returned to the originating cache by the cache containing the cache line either directly or through the memory controller. If a cache line is not found within the coherence domain, the memory controller accesses the memory to retrieve the cache line. | 12-01-2011 |
20110296149 | Instruction Set Architecture Extensions for Performing Power Versus Performance Tradeoffs - Mechanisms are provided for processing an instruction in a processor of a data processing system. The mechanisms operate to receive, in a processor of the data processing system, an instruction, the instruction including power/performance tradeoff information associated with the instruction. The mechanisms further operate to determine power/performance tradeoff priorities or criteria, specifying whether power conservation or performance is prioritized with regard to execution of the instruction, based on the power/performance tradeoff information. Moreover, the mechanisms process the instruction in accordance with the power/performance tradeoff priorities or criteria identified based on the power/performance tradeoff information of the instruction. | 12-01-2011 |
20110296434 | Techniques for Dynamically Sharing a Fabric to Facilitate Off-Chip Communication for Multiple On-Chip Units - A technique for sharing a fabric to facilitate off-chip communication for on-chip units includes dynamically assigning a first unit that implements a first communication protocol to a first portion of the fabric when private fabrics are indicated for the on-chip units. The technique also includes dynamically assigning a second unit that implements a second communication protocol to a second portion of the fabric when the private fabrics are indicated for the on-chip units. In this case, the first and second units are integrated in a same chip and the first and second protocols are different. The technique further includes dynamically assigning, based on off-chip traffic requirements of the first and second units, the first unit or the second unit to the first and second portions of the fabric when the private fabrics are not indicated for the on-chip units. | 12-01-2011 |
20120265944 | Assigning Memory to On-Chip Coherence Domains - A mechanism for assigning memory to on-chip cache coherence domains assigns caches within a processing unit to coherence domains. The mechanism assigns chunks of memory to the coherence domains. The mechanism monitors applications running on cores within the processing unit to identify needs of the applications. The mechanism may then reassign memory chunks to the cache coherence domains based on the needs of the applications running in the coherence domains. When a memory controller receives the cache miss, the memory controller may look up the address in a lookup table that maps memory chunks to cache coherence domains. Snoop requests are sent to caches within the coherence domain. If a cache line is found in a cache within the coherence domain, the cache line is returned to the originating cache by the cache containing the cache line either directly or through the memory controller. | 10-18-2012 |
20120266180 | Performing Setup Operations for Receiving Different Amounts of Data While Processors are Performing Message Passing Interface Tasks - A system and method are provided for performing setup operations for receiving a different amount of data while processors are performing message passing interface (MPI) tasks. Mechanisms for adjusting the balance of processing workloads of the processors are provided so as to minimize wait periods for waiting for all of the processors to call a synchronization operation. An MPI load balancing controller maintains a history that provides a profile of the tasks with regard to their calls to synchronization operations. From this information, it can be determined which processors should have their processing loads lightened and which processors are able to handle additional processing loads without significantly negatively affecting the overall operation of the parallel execution system. As a result, setup operations may be performed while processors are performing MPI tasks to prepare for receiving different sized portions of data in a subsequent computation cycle based on the history. | 10-18-2012 |
20120311265 | Read and Write Aware Cache - A mechanism is provided in a cache for providing a read and write aware cache. The mechanism partitions a large cache into a read-often region and a write-often region. The mechanism considers read/write frequency in a non-uniform cache architecture replacement polity. A frequently written cache line is placed in one of the farther banks. A frequently read cache line is place in one of the closer banks. The size ration between read-often and write-often regions may be static or dynamic. The boundary between the read-often region and the write-often region may be distinct or fuzzy. | 12-06-2012 |