Paul E. Schardt, Rochester US

Paul E. Schardt, Rochester, MN US

Patent application number	Description	Published
20090125574	Software Pipelining On a Network On Chip - Memory sharing in a software pipeline on a network on chip (‘NOC’), the NOC including integrated processor (‘IP’) blocks, routers, memory communications controllers, and network interface controllers, with each IP block adapted to a router through a memory communications controller and a network interface controller, where each memory communications controller controlling communications between an IP block and memory, and each network interface controller controlling inter-IP block communications through routers, including segmenting a computer software application into stages of a software pipeline, the software pipeline comprising one or more paths of execution; allocating memory to be shared among at least two stages including creating a smart pointer, the smart pointer including data elements for determining when the shared memory can be deallocated; determining, in dependence upon the data elements for determining when the shared memory can be deallocated, that the shared memory can be deallocated; and deallocating the shared memory.	05-14-2009
20090125703	Context Switching on a Network On Chip - Data processing on a network on chip (‘NOC’) that includes integrated processor (‘IP’) blocks, routers, memory communications controllers, and network interface controllers, with each IP block adapted to a router through a memory communications controller and a network interface controller, with each IP block also adapted to the network by a low latency, high bandwidth application messaging interconnect comprising an inbox and an outbox, each IP block also including a stack normally used for context switching, the stack access slower than the outbox access, and each IP block further including a processor supporting a plurality of threads of execution, the processor configured to save, upon a context switch, a context of a current thread of execution in memory locations in a memory array in the outbox instead of the stack and lock the memory locations in which the context was saved.	05-14-2009
20090125706	Software Pipelining on a Network on Chip - A network on chip (‘NOC’) that includes integrated processor (‘IP’) blocks, routers, memory communications controllers, and network interface controllers, with each IP block adapted to a router through a memory communications controller and a network interface controller, where each memory communications controller controlling communications between an IP block and memory, and each network interface controller controlling inter-IP block communications through routers, the NOC also including a computer software application segmented into stages, each stage comprising a flexibly configurable module of computer program instructions identified by a stage ID with each stage executing on a thread of execution on an IP block.	05-14-2009
20090135739	Network On Chip With Partitions - A network on chip (‘NOC’) that includes integrated processor (‘IP’) blocks, routers, memory communications controllers, and network interface controllers, with each IP block adapted to a router through a memory communications controller and a network interface controller, where each memory communications controller controlling communications between an IP block and memory, and each network interface controller controlling inter-IP block communications through routers, with the network organized into partitions, each partition including at least one IP block, each partition assigned exclusive access to a separate physical memory address space and one or more applications executing on one or more of the partitions.	05-28-2009
20090138567	Network on chip with partitions - A design structure embodied in a machine readable medium is provided. Embodiments of the design structure include a network on chip (‘NOC’), the NOC comprising: integrated processor (‘IP’) blocks, routers, memory communications controllers, and network interface controller, each IP block adapted to a router through a memory communications controller and a network interface controller, each memory communications controller controlling communication between an IP block and memory, and each network interface controller controlling inter-IP block communications through routers; the network organized into partitions, each partition including at least one IP block, each partition assigned exclusive access to a separate physical memory address space; and one or more applications executing on one or more of the partitions.	05-28-2009
20090201302	Graphics Rendering On A Network On Chip - Graphics rendering on a network on chip (‘NOC’) including receiving, in the geometry processor, a representation of an object to be rendered; converting, by the geometry processor, the representation of the object to two dimensional primitives; sending, by the geometry processor, the primitives to the plurality of scan converters; converting, by the scan converters, the primitives to fragments, each fragment comprising one or more portions of a pixel; for each fragment: selecting, by the scan converter for the fragment in dependence upon sorting rules, a pixel processor to process the fragment; sending, by the scan converter to the pixel processor, the fragment; and processing, by the pixel processor, the fragment to produce pixels for an image.	08-13-2009
20090271172	Emulating A Computer Run Time Environment - Emulating a computer run time environment as a component of a dynamic binary translation loop that translates target executable code compiled for execution on a target computer to code executable on a host computer of a kind other than the target computer, the target executable code including function calls to functions to be translated. Embodiments of the present invention include: determining, upon encountering in the binary translation loop a function call to a function to be translated, that the function call is a call to a host library function in a host native library; hashing a target executable image of the function to be translated from the target executable code, thereby producing a hash value; and using the hash value as an index to retrieve from a thunk table a host native address of the host library function in the host native library.	10-29-2009
20090271597	Branch Prediction In A Computer Processor - Methods, apparatus, and products for branch prediction in a computer processor are disclosed that include: recording for a sequence of occurrences of a branch, in an algorithm in which the branch occurs more than once, each result of the branch, including maintaining a pointer to a location of a most recently recorded result; resetting the pointer to a location of the first recorded result upon completion of the algorithm; and predicting subsequent results of the branch, in subsequent occurrences of the branch, in dependence upon the recorded results.	10-29-2009
20090282139	Emulating A Computer Run Time Environment - Emulating a computer run time environment including: storing translated code in blocks of a translated code cache, each block of the translated code cache designated for storage of translated code for a separate one of the target executable processes, including identifying each block in dependence upon an identifier of the process for which the block is designated as storage; executing by the emulation environment a particular one of the target executable processes, using for target code translation the translated code in the block of the translated code cache designated as storage for the particular process; and upon encountering a context switch by the target operating system to execution of a new target executable process, changing from the block designated for the particular process to using for target code translation the translated code in the block of the translated code cache designated as storage for the new target executable process.	11-12-2009
20090282211	Network On Chip With Partitions - Data processing with a network on chip (‘NOC’) that includes integrated processor (‘IP’) blocks, routers, memory communications controllers, and network interface controller, including: organizing the network into partitions; assigning all IP blocks of a partition a partition identifier (‘partition ID’) that uniquely identifies for an IP block a particular partition in which the IP block is included; establishing one or more permissions tables associating partition IDs with sources and destinations of data communications on the NOC, each record in the permissions tables representing a restriction on data communications on the NOC; executing one or more applications on one or more of the partitions, including transmitting data communications messages among IP blocks and between IP blocks and memory, each data communications message including a partition ID of a sender of the data communications message; and controlling data communications among the partitions in dependence upon the permissions tables and the partition IDs.	11-12-2009
20090282214	Network On Chip With Low Latency, High Bandwidth Application Messaging Interconnects That Abstract Hardware Inter-Thread Data Communications Into An Architected State of A Processor - Data processing on a network on chip (‘NOC’) that includes integrated processor (‘IP’) blocks, each of a plurality of the IP blocks including at least one computer processor, each such computer processor implementing a plurality of hardware threads of execution; low latency, high bandwidth application messaging interconnects; memory communications controllers; network interface controllers; and routers; each of the IP blocks adapted to a router through a separate one of the low latency, high bandwidth application messaging interconnects, a separate one of the memory communications controllers, and a separate one of the network interface controllers; each application messaging interconnect abstracting into an architected state of each processor, for manipulation by computer programs executing on the processor, hardware inter-thread communications among the hardware threads of execution; each memory communications controller controlling communication between an IP block and memory; each network interface controller controlling inter-IP block communications through routers.	11-12-2009
20090282222	Dynamic Virtual Software Pipelining On A Network On Chip - A NOC for dynamic virtual software pipelining including IP blocks, routers, memory communications controllers, and network interface controllers, each IP block adapted to a router through a memory communications controller and a network interface controller, the NOC also including: a computer software application segmented into stages, each stage comprising a flexibly configurable module of computer program instructions identified by a stage ID, each stage assigned to a thread of execution on an IP block; and each stage executing on a thread of execution on an IP block, including a first stage executing on an IP block, producing output data and sending by the first stage the produced output data to a second stage, the output data including control information for the next stage and payload data; and the second stage consuming the produced output data in dependence upon the control information.	11-12-2009
20090282226	Context Switching On A Network On Chip - A network on chip (‘NOC’) that includes IP blocks, routers, memory communications controllers, and network interface controllers, each IP block adapted to the network by an application messaging interconnect including an inbox and an outbox, one or more of the IP blocks including computer processors supporting a plurality of threads, the NOC also including an inbox and outbox controller configured to set pointers to the inbox and outbox, respectively, that identify valid message data for a current thread; and software running in the current thread that, upon a context switch to a new thread, is configured to: save the pointer values for the current thread, and reset the pointer values to identify valid message data for the new thread, where the inbox and outbox controller are further configured to retain the valid message data for the current thread in the boxes until context switches again to the current thread.	11-12-2009
20090282227	Monitoring Software Pipeline Performance On A Network On Chip - Software pipelining on a network on chip (‘NOC’), the NOC including integrated processor (‘IP’) blocks, routers, memory communications controllers, and network interface controllers, each IP block adapted to a router through a memory communications controller and a network interface controller, each memory communications controller controlling communication between an IP block and memory, and each network interface controller controlling inter-IP block communications through routers. Embodiments of the present invention include implementing a software pipeline on the NOC, including segmenting a computer software application into stages, each stage comprising a flexibly configurable module of computer program instructions identified by a stage ID; executing each stage of the software pipeline on a thread of execution on an IP block; monitoring software pipeline performance in real time; and reconfiguring the software pipeline, dynamically, in real time, and in dependence upon the monitored software pipeline performance.	11-12-2009
20090282419	Ordered And Unordered Network-Addressed Message Control With Embedded DMA Commands For A Network On Chip - Data processing on a network on chip (‘NOC’) that includes integrated processor (‘IP’) blocks, routers, memory communications controllers, network interface controllers, and network-addressed message controllers, with each IP block adapted to a router through a memory communications controller, a network-addressed message controller, and a network interface controller, where each memory communications controller controlling communications between an IP block and memory, each network interface controller controlling inter-IP block communications through routers, with each IP block also adapted to the network by a low latency, high bandwidth application messaging interconnect comprising an inbox and an outbox.	11-12-2009
20100100707	DATA STRUCTURE FOR CONTROLLING AN ALGORITHM PERFORMED ON A UNIT OF WORK IN A HIGHLY THREADED NETWORK ON A CHIP - A computer-implemented method, system and computer program product for controlling an algorithm that is performed on a unit of work in a subsequent software pipeline stage in a Network On a Chip (NOC) is presented. In one embodiment, the method executes a first operation in a first node of the NOC. The first node generates payload, and then loads that payload into a message. The message with the payload is transmitted to a nanokernel that controls a second node in the NOC. The nanokernel calls an algorithm that is needed by a second operation in a second node in the NOC, which uses the algorithm to execute the second operation.	04-22-2010
20100100770	SOFTWARE DEBUGGER FOR PACKETS IN A NETWORK ON A CHIP - A breakpoint packet is dispatched to a Network On A Chip (NOC). The breakpoint packet instructs one or more specified nodes on the NOC to place the specified nodes, or a core or hardware thread within a specified node, to execute in “single step” mode, in order to enable a debugging of a work packet that is dispatched to the specific node.	04-22-2010
20100100934	SECURITY METHODOLOGY TO PREVENT USER FROM COMPROMISING THROUGHPUT IN A HIGHLY THREADED NETWORK ON A CHIP PROCESSOR - A computer-implemented method, system and computer program product for preventing an untrusted work unit message from compromising throughput in a highly threaded Network On a Chip (NOC) processor are presented. A security message, which is associated with the untrusted work unit message, directs other resources within the NOC to operate in a secure mode while a specified node, within the NOC, executes instructions from the work unit message in a less privileged non-secure mode. Thus, throughput within the NOC is uncompromised due to resources, other than the first node, being protected from the untrusted work unit message.	04-22-2010
20100188402	User-Defined Non-Visible Geometry Featuring Ray Filtering - A method, system and computer program product for managing secondary rays during ray-tracing are presented. A non-visible unidirectional ray tracing object logically surrounds a user-selected virtual object in a computer generated illustration. This unidirectional ray tracing object prevents secondary tracing rays from emanating from the user-selected virtual object during ray tracing.	07-29-2010
20100191940	SINGLE STEP MODE IN A SOFTWARE PIPELINE WITHIN A HIGHLY THREADED NETWORK ON A CHIP MICROPROCESSOR - A hardware thread is selectively forced to single step the execution of software instructions from a work packet granule. A “single step” packet is associated with a work packet granule. The work packet granule, with the associated “single step” packet, is dispatched as an appended work packet granule to a preselected hardware thread in a processor core, which, in one embodiment, is located at a node in a Network On a Chip (NOC). The work packet granule then executes in a single step mode until completion.	07-29-2010
20100192014	PSEUDO RANDOM PROCESS STATE REGISTER FOR FAST RANDOM PROCESS TEST GENERATION - A method, system and computer program product are presented for providing pseudo-random input test data to a test program. A seed value is generated and stored in a seed register. Using the seed value as an input, a pseudo-random input test value is generated by a Linear Feedback Shift Register (LFSR), and stored in a GPR within a processor core. Using the pseudo-random input test value from the GPR, a test program is executed within the processor core.	07-29-2010
20100199067	Split Vector Loads and Stores with Stride Separated Words - A method, system and computer program product are presented for causing a parallel load/store of stride-separated words from a data vector using different memory chips in a computer.	08-05-2010
20100238169	Physical Rendering With Textured Bounding Volume Primitive Mapping - A circuit arrangement, program product and circuit arrangement utilize a textured bounding volume to reduce the overhead associated with generating and using an Accelerated Data Structure (ADS) in connection with physical rendering. In particular, a subset of the primitives in a scene may be mapped to surfaces of a bounding volume to generate textures on such surfaces that can be used during physical rendering. By doing so, the primitives that are mapped to the bounding volume surfaces may be omitted from the ADS to reduce the processing overhead associated with both generating the ADS and using the ADS during physical rendering, and furthermore, in many instances the size of the ADS may be reduced, thus reducing the memory footprint of the ADS, and often improving cache hit rates and reducing memory bandwidth.	09-23-2010
20100239185	Accelerated Data Structure Optimization Based Upon View Orientation - A circuit arrangement, program product and method utilize the known view orientation for an image frame to be rendered to optimize the generation and/or use of an Accelerated Data Structure (ADS) used in physical rendering-based image processing. In particular, it has been found that while geometry primitives that are not within a view orientation generally cannot be culled from a scene when a physical rendering technique such as ray tracing is performed, those primitives nonetheless have a smaller impact on the resulting image frame, and as a result, less processing resources can be applied to such primitives, leaving greater processing resources available for processing those primitives that are located within the view orientation, and thereby improving overall rendering performance.	09-23-2010
20100239186	Accelerated Data Structure Positioning Based Upon View Orientation - A circuit arrangement, program product and circuit arrangement utilize the known view orientation for an image frame to be rendered to reposition an Accelerated Data Structure (ADS) used during rendering to optimize the generation and/or use of the ADS, e.g., by transforming a scene from which an image frame is rendered to orient the scene relative to the view orientation prior to generating the ADS. A scene may be transformed, for example, to orient the view orientation within a single octant of the scene, with additional processing resources assigned to that octant to ensure sufficient processing resources are devoted to processing the primitives within the view orientation.	09-23-2010
20100269123	Performance Event Triggering Through Direct Interthread Communication On a Network On Chip - Performance event triggering through direct interthread communication (‘DITC’) on a network on chip (‘NOC’), the NOC including integrated processor (‘IP’) blocks, routers, memory communications controllers, and network interface controllers, with each IP block adapted to a router through a memory communications controller and a network interface controller, where each memory communications controller controlling communications between an IP block and memory, and each network interface controller controlling inter-IP block communications through routers, including enabling performance event monitoring in a selected set of IP blocks distributed throughout the NOC, each IP block within the selected set of IP blocks having one or more event counters; collecting performance results from the one or more event counters; and returning performance results from the one or more event counters to a destination repository, the returning being initiated by a triggering event occurring within the NOC.	10-21-2010
20110285709	Allocating Resources Based On A Performance Statistic - A method includes rendering an object of a three dimensional image via a pixel shader based on a render context data structure associated with the object. The method includes measuring a performance statistic associated with rendering the object. The method also includes storing the performance statistic in the render context data structure associated with the object. The performance statistic is accessible to a host interface processor to determine whether to allocate a second pixel shader to render the object in a subsequent three-dimensional image.	11-24-2011
20110285710	Parallelized Ray Tracing - A method includes assigning a priority to a ray data structure of a plurality of ray data structures based on one or more priorities. The ray data structure includes properties of a ray to be traced from an illumination source in a three-dimensional image. The method includes identifying a portion of the three-dimensional image through which the ray passes. The method also includes identifying a slave processing element associated with the portion of the three-dimensional image. The method further includes sending the ray data structure to the slave processing element.	11-24-2011
20110289485	Software Trace Collection and Analysis Utilizing Direct Interthread Communication On A Network On Chip - Collecting and analyzing trace data while in a software debug mode through direct interthread communication (‘DITC’) on a network on chip (‘NOC’), the NOC including integrated processor (‘IP’) blocks, routers, memory communications controllers, and network interface controllers, with each IP block adapted to a router through a memory communications controller and a network interface controller, where each memory communications controller controlling communications between an IP block and memory, and each network interface controller controlling inter-IP block communications through routers, including enabling the collection of software debug information in a selected set of IP blocks distributed through the NOC, each IP block within the selected set of IP blocks having a set of trace data; collecting software debugging information via the set of trace data; communicating the set of trace data to a destination repository; and analyzing the set of trace data at the destination repository.	11-24-2011
20110316855	Parallelized Streaming Accelerated Data Structure Generation - A method includes receiving at a master processing element primitive data that includes properties of a primitive. The method includes partially traversing a spatial data structure that represents a three-dimensional image to identify an internal node of the spatial data structure. The internal node represents a portion of the three-dimensional image. The method also includes selecting a slave processing element from a plurality of slave processing elements. The selected processing element is associated with the internal node. The method further includes sending the primitive data to the selected slave processing element to traverse a portion of the spatial data structure to identify a leaf node of the spatial data structure.	12-29-2011
20110316864	MULTITHREADED SOFTWARE RENDERING PIPELINE WITH DYNAMIC PERFORMANCE-BASED REALLOCATION OF RASTER THREADS - A multithreaded rendering software pipeline architecture dynamically reallocates regions of an image space to raster threads based upon performance data collected by the raster threads. The reallocation of the regions typically includes resizing the regions assigned to particular raster threads and/or reassigning regions to different raster threads to better balance the relative workloads of the raster threads.	12-29-2011
20110317712	Recovering Data From A Plurality of Packets - A method includes receiving a plurality of packets at an integrated processor block of a network on a chip device. The plurality of packets includes a first packet that includes an indication of a start of data associated with a pixel shader application. The method includes recovering the data from the plurality of packets. The method also includes storing the recovered data in a dedicated packet collection memory within the network on the chip device. The method further includes retaining the data stored in the dedicated packet collection memory during an interruption event. Upon completion of the interruption event, the method includes copying packets stored in the dedicated packet collection memory prior to the interruption event to an inbox of the network on the chip device for processing.	12-29-2011
20110320719	PROPAGATING SHARED STATE CHANGES TO MULTIPLE THREADS WITHIN A MULTITHREADED PROCESSING ENVIRONMENT - A circuit arrangement and method make state changes to shared state data in a highly multithreaded environment by propagating or streaming the changes to multiple parallel hardware threads of execution in the multithreaded environment using an on-chip communications network and without attempting to access any copy of the shared state data in a shared memory to which the parallel threads of execution are also coupled. Through the use of an on-chip communications network, changes to the shared state data may be communicated quickly and efficiently to multiple threads of execution, enabling those threads to locally update their local copies of the shared state. Furthermore, by avoiding attempts to access a shared memory, the interface to the shared memory is not overloaded with concurrent access attempts, thus preserving memory bandwidth for other activities and reducing memory latency. Particularly for larger shared states, propagating the changes, rather than an entire shared state, further improves performance by reducing the amount of data communicated over the on-chip communications network.	12-29-2011
20110320724	DMA-BASED ACCELERATION OF COMMAND PUSH BUFFER BETWEEN HOST AND TARGET DEVICES - Direct Memory Access (DMA) is used in connection with passing commands between a host device and a target device coupled via a push buffer. Commands passed to a push buffer by a host device may be accumulated by the host device prior to forwarding the commands to the push buffer, such that DMA may be used to collectively pass a block of commands to the push buffer. In addition, a host device may utilize DMA to pass command parameters for commands to a command buffer that is accessible by the target device but is separate from the push buffer, with the commands that are passed to the push buffer including pointers to the associated command parameters in the command buffer.	12-29-2011
20110320771	INSTRUCTION UNIT WITH INSTRUCTION BUFFER PIPELINE BYPASS - A circuit arrangement and method selectively bypass an instruction buffer for selected instructions so that bypassed instructions can be dispatched without having to first pass through the instruction buffer. Thus, for example, in the case that an instruction buffer is partially or completely flushed as a result of an instruction redirect (e.g., due to a branch mispredict), instructions can be forwarded to subsequent stages in an instruction unit and/or to one or more execution units without the latency associated with passing through the instruction buffer.	12-29-2011
20110321049	Programmable Integrated Processor Blocks - An integrated processor block of the network on a chip is programmable to perform a first function. The integrated processor block includes an inbox to receive incoming packets from other integrated processor blocks of a network on a chip, an outbox to send outgoing packets to the other integrated processor blocks, an on-chip memory, and a memory management unit to enable access to the on-chip memory.	12-29-2011
20120176364	REUSE OF STATIC IMAGE DATA FROM PRIOR IMAGE FRAMES TO REDUCE RASTERIZATION REQUIREMENTS - An apparatus, program product and method reuse static image data generated during rasterization of static geometry to reduce the processing overhead associated with rasterizing subsequent image frames. In particular, static image data generated one frame may be reused in a subsequent image frame such that the subsequent image frame is generated without having to re-rasterize the static geometry from the scene, i.e., with only the dynamic geometry rasterized. The resulting image frame includes dynamic image data generated as a result of rasterizing the dynamic geometry during that image frame, and static image data generated as a result of rasterizing the static image data during a prior image frame.	07-12-2012
20120179899	UPGRADEABLE PROCESSOR ENABLING HARDWARE LICENSING - A technique for programming a configurable co-processor in a processing system is disclosed. The configurable co-processor includes field programmable logic and is configured using a pre-generated co-processor image. The technique involves enabling a user application to program the configurable co-processor with certain application-specific hardware based processing functions. One advantage of the present invention is that application-specific hardware design optimizations may be implemented to improve application performance after hardware for the processing system has been manufactured.	07-12-2012
20120192202	Context Switching On A Network On Chip - A network on chip (NOC) that includes IP blocks, routers, memory communications controllers, and network interface controllers, each IP block adapted to the network by an application messaging interconnect including an inbox and an outbox, one or more of the IP blocks including computer processors supporting a plurality of threads, the NOC also including an inbox and outbox controller configured to set pointers to the inbox and outbox, respectively, that identify valid message data for a current thread; and software running in the current thread that, upon a context switch to a new thread, is configured to: save the pointer values for the current thread, and reset the pointer values to identify valid message data for the new thread, where the inbox and outbox controller are further configured to retain the valid message data for the current thread in the boxes until context switches again to the current thread.	07-26-2012
20120209944	Software Pipelining On A Network On Chip - Memory sharing in a software pipeline on a network on chip (‘NOC’), the NOC including integrated processor (‘IP’) blocks, routers, memory communications controllers, and network interface controllers, with each IP block adapted to a router through a memory communications controller and a network interface controller, where each memory communications controller controlling communications between an IP block and memory, and each network interface controller controlling inter-IP block communications through routers, including segmenting a computer software application into stages of a software pipeline, the software pipeline comprising one or more paths of execution; allocating memory to be shared among at least two stages including creating a smart pointer, the smart pointer including data elements for determining when the shared memory can be deallocated; determining, in dependence upon the data elements for determining when the shared memory can be deallocated, that the shared memory can be deallocated; and deallocating the shared memory.	08-16-2012
20120221711	REGULAR EXPRESSION SEARCHES UTILIZING GENERAL PURPOSE PROCESSORS ON A NETWORK INTERCONNECT - A first hardware node in a network interconnect receives a data packet from a network. The first hardware node examines the data packet for a regular expression. In response to the first hardware node failing to identify the regular expression in the data packet, the data packet is forwarded to a second hardware node in the network interconnect for further examination of the data packet in order to search for the regular expression in the data packet.	08-30-2012
20120260252	SCHEDULING SOFTWARE THREAD EXECUTION - A computer-implemented method, system, and/or computer program product schedules execution of software threads. A first software thread is executed together with a second software thread as a first software thread pair. A first content, which resulted from executing the first software pair together, of at least one performance counter, is stored. The first software thread is then executed with a third software thread as a second software thread pair, and the resulting second content of the performance counter(s) is stored. An identification is made of a most efficient software thread pair from the first and second software thread pairs. Upon receiving a request to re-execute the first software thread, the first software thread is selectively matched with either the second software thread or the third software thread for execution based on whether the first software thread pair or the second software thread pair has been identified as the most efficient software thread pair.	10-11-2012
20130044117	VECTOR REGISTER FILE CACHING OF CONTEXT DATA STRUCTURE FOR MAINTAINING STATE DATA IN A MULTITHREADED IMAGE PROCESSING PIPELINE - Frequently accessed state data used in a multithreaded graphics processing architecture is cached within a vector register file of a processing unit to optimize accesses to the state data and minimize memory bus utilization associated therewith. A processing unit may include a fixed point execution unit as well as a vector floating point execution unit, and a vector register file utilized by the vector floating point execution unit may be used to cache state data used by the fixed point execution unit and transferred as needed into the general purpose registers accessible by the fixed point execution unit, thereby reducing the need to repeatedly retrieve and write back the state data from and to an L1 or lower level cache accessed by the fixed point execution unit.	02-21-2013
20130046518	MULTITHREADED PHYSICS ENGINE WITH IMPULSE PROPAGATION - A circuit arrangement and method implement impulse propagation in a multithreaded physics engine by assigning ownership of objects in a scene to individual threads and propagating impulses between objects that are in contact with one another by passing inter-thread impulse messages between the threads that own the contacting objects, while locally propagating impulses through objects using the threads to which such objects are assigned.	02-21-2013
20130111190	OPERATIONAL CODE EXPANSION IN RESPONSE TO SUCCESSIVE TARGET ADDRESS DETECTION	05-02-2013
20130138918	DIRECT INTERTHREAD COMMUNICATION DATAPORT PACK/UNPACK AND LOAD/SAVE - A circuit arrangement, method, and program product for compressing and decompressing data in a node of a system including a plurality of nodes interconnected via an on-chip network. Compressed data may be received and stored at an input buffer of a node, and in parallel with moving the compressed data to an execution register of the node, decompression logic of the node may decompress the data to generate uncompressed data, such that uncompressed data is stored in the execution register for utilization by an execution unit of the node. Uncompressed data may be output by the execution unit into the execution register, and in parallel with moving the uncompressed data to an output buffer of the node connected to the on-chip network, compression logic may compress the uncompressed data to generate compressed data, such that compressed data is stored at the output buffer.	05-30-2013
20130145128	PROCESSING CORE WITH PROGRAMMABLE MICROCODE UNIT - A method and circuit arrangement utilize a programmable microcode unit that is capable of being programmed via software to modify the instruction sequences output by the microcode unit in response to microcode instructions issued to the microcode unit. Among other benefits, a programmable microcode unit consistent with the invention enables customization of a processor design to handle specific applications or tasks, as well as to support specific hardware configurations such as specific execution units. In addition, a programmable microcode unit may be updatable, e.g., to correct bugs or faults found in previous instruction sequences supported by the unit.	06-06-2013
20130159668	PREDECODE LOGIC FOR AUTOVECTORIZING SCALAR INSTRUCTIONS IN AN INSTRUCTION BUFFER - A circuit arrangement, method, and program product for substituting a plurality of scalar instructions in an instruction stream with a functionally equivalent vector instruction for execution by a vector execution unit. Predecode logic is coupled to an instruction buffer which stores instructions in an instruction stream to be executed by the vector execution unit. The predecode logic analyzes the instructions passing through the instruction buffer to identify a plurality of scalar instructions that may be replaced by a vector instruction in the instruction stream. The predecode logic may generate the functionally equivalent vector instruction based on the plurality of scalar instructions, and the functionally equivalent vector instruction may be substituted into the instruction stream, such that the vector execution unit executes the vector instruction in lieu of the plurality of scalar instructions.	06-20-2013
20130159674	INSTRUCTION PREDICATION USING INSTRUCTION FILTERING - A method and circuit arrangement for selectively predicating instructions in an instruction stream based upon a predication filter criteria defined by a predication filter, which describes types or patterns of instructions that should be predicated. Predication logic compares a respective instruction of an instruction stream to predication filter criteria to determine whether the respective instruction matches the predication filter criteria, and the respective instruction is selectively predicated based on whether the respective instruction matches the predication filter criteria.	06-20-2013
20130159675	INSTRUCTION PREDICATION USING UNUSED DATAPATH FACILITIES - A method and circuit arrangement for selectively predicating an instruction in an instruction stream based upon a value corresponding to a predication register address indicated by a portion of an operand associated with the instruction. A first compare instruction in an instruction stream stores a compare result in at a register address of a predication register. The register address of the predication register is stored in a portion of an operand associated with a second instruction, and during decoding the second instruction, the predication register is accessed to determine a value stored at the register address of the predication register, and the second instruction is selectively predicated based on the value stored at the register address of the predication register.	06-20-2013
20130159676	INSTRUCTION SET ARCHITECTURE WITH EXTENDED REGISTER ADDRESSING - A method and circuit arrangement selectively repurpose bits from a primary opcode portion of an instruction for use in decoding one or more operands for the instruction. Decode logic of a processor, for example, may be placed in a predetermined mode that decodes a primary opcode for an instruction that is different from that specified in the primary opcode portion of the instruction, and then utilize one or more bits in the primary opcode portion to decode one or more operands for the instruction. By doing so, additional space is freed up in the instruction to support a larger register file and/or additional instruction types, e.g., as specified by a secondary or extended opcode.	06-20-2013
20130160026	INDIRECT INTER-THREAD COMMUNICATION USING A SHARED POOL OF INBOXES - A circuit arrangement, method, and program product for communicating data between hardware threads of a network on a chip processing unit utilizes shared inboxes to communicate data to pools of hardware threads. The associated hardware in the pools threads receive data packets from the shared inboxes in response to issuing work requests to an associated shared inbox. Data packets include a source identifier corresponding to a hardware thread from which the data packet was generated, and the shared inboxes may manage data packet distribution to associated hardware threads based on the source identifier of each data packet. A shared inbox may also manage workload distribution and uneven workload lengths by communicating data packets to hardware threads associated with the shared inbox in response to receiving work requests from associated hardware threads.	06-20-2013
20130160114	INTER-THREAD COMMUNICATION WITH SOFTWARE SECURITY - A circuit arrangement and method utilize a process context translation data structure in connection with an on-chip network of a processor chip to implement secure inter-thread communication between hardware threads in the processor chip. The process context translation data structure maps processes to inter-thread communication hardware resources, e.g., the inbox and/or outbox buffers of a NOC processor, such that a user process is only allowed to access the inter-thread communication hardware resources that it has been granted access to, and typically with only certain types of authorized access types. Moreover, a hypervisor or supervisor may manage the process context translation data structure to grant or deny access rights to user processes such that, once those rights are established in the data structure, user processes are permitted to perform inter-thread communications without requiring context switches to a hypervisor or supervisor in order to handle the communications.	06-20-2013
20130185542	EXTERNAL AUXILIARY EXECUTION UNIT INTERFACE TO OFF-CHIP AUXILIARY EXECUTION UNIT - An external Auxiliary Execution Unit (AXU) interface is provided between a processing core disposed in a first programmable chip and an off-chip AXU disposed in a second programmable chip to integrate the AXU with an issue unit, a fixed point execution unit, and optionally other functional units in the processing core. The external AXU interface enables the issue unit to issue instructions to the AXU in much the same manner as the issue unit would be able to issue instructions to an AXU that was disposed on the same chip. By doing so, the AXU on the second programmable chip can be designed, tested and verified independent of the processing core on the first programmable chip, thereby enabling a common processing core, which has been designed, tested, and verified, to be used in connection with multiple different AXU designs.	07-18-2013
20130185704	PROVIDING PERFORMANCE TUNED VERSIONS OF COMPILED CODE TO A CPU IN A SYSTEM OF HETEROGENEOUS CORES - A compiler may optimize source code and any referenced libraries to execute on a plurality of different processor architecture implementations. For example, if a compute node has three different types of processors with three different architecture implementations, the compiler may compile the source code and generate three versions of object code where each version is optimized for one of the three different processor types. After compiling the source code, the resultant executable code may contain the necessary information for selecting between the three versions. For example, when a program loader assigns the executable code to the processor, the system determines the processor's type and ensures only the optimized version that corresponds to that type is executed. Thus, the operating system is free to assign the executable code to any processor based on, for example, the current status of the processor (i.e., whether its CPU is being fully utilized) and still enjoy the benefits of executing code that is optimized for whichever processor is assigned the executable code.	07-18-2013
20130185705	PROVIDING PERFORMANCE TUNED VERSIONS OF COMPILED CODE TO A CPU IN A SYSTEM OF HETEROGENEOUS CORES - A compiler may optimize source code and any referenced libraries to execute on a plurality of different processor architecture implementations. For example, if a compute node has three different types of processors with three different architecture implementations, the compiler may compile the source code and generate three versions of object code where each version is optimized for one of the three different processor types. After compiling the source code, the resultant executable code may contain the necessary information for selecting between the three versions. For example, when a program loader assigns the executable code to the processor, the system determines the processor's type and ensures only the optimized version that corresponds to that type is executed. Thus, the operating system is free to assign the executable code to any of the different types of processors.	07-18-2013
20130191600	COMBINED CACHE INJECT AND LOCK OPERATION - A circuit arrangement and method utilize cache injection logic to perform a cache inject and lock operation to inject a cache line in a cache memory and automatically lock the cache line in the cache memory in parallel with communication of the cache line to a main memory. The cache injection logic may additionally limit the maximum number of locked cache lines that may be stored in the cache memory, e.g., by aborting a cache inject and lock operation, injecting the cache line without locking, or unlocking and/or evicting another cache line in the cache memory.	07-25-2013
20130191649	MEMORY ADDRESS TRANSLATION-BASED DATA ENCRYPTION/COMPRESSION - A method and circuit arrangement selectively stream data to an encryption or compression engine based upon encryption and/or compression-related page attributes stored in a memory address translation data structure such as an Effective To Real Translation (ERAT) or Translation Lookaside Buffer (TLB). A memory address translation data structure may be accessed, for example, in connection with a memory access request for data in a memory page, such that attributes associated with the memory page in the data structure may be used to control whether data is encrypted/decrypted and/or compressed/decompressed in association with handling the memory access request.	07-25-2013
20130191651	MEMORY ADDRESS TRANSLATION-BASED DATA ENCRYPTION WITH INTEGRATED ENCRYPTION ENGINE - A method and circuit arrangement utilize an integrated encryption engine within a processing core of a multi-core processor to perform encryption operations, i.e., encryption and decryption of secure data, in connection with memory access requests that access such data. The integrated encryption engine is utilized in combination with a memory address translation data structure such as an Effective To Real Translation (ERAT) or Translation Lookaside Buffer (TLB) that is augmented with encryption-related page attributes to indicate whether pages of memory identified in the data structure are encrypted such that secure data associated with a memory access request in the processing core may be selectively streamed to the integrated encryption engine based upon the encryption-related page attribute for the memory page associated with the memory access request.	07-25-2013
20130191824	VIRTUALIZATION SUPPORT FOR BRANCH PREDICTION LOGIC ENABLE/DISABLE - A hypervisor and one or more guest operating systems resident in a data processing system and hosted by the hypervisor are configured to selectively enable or disable branch prediction logic through separate hypervisor-mode and guest-mode instructions. By doing so, different branch prediction strategies may be employed for different operating systems and user applications hosted thereby to provide finer grained optimization of the branch prediction logic for different operating scenarios.	07-25-2013
20130191825	VIRTUALIZATION SUPPORT FOR SAVING AND RESTORING BRANCH PREDICTION LOGIC STATES - A hypervisor and one or more programs, e.g., guest operating systems and/or user processes or applications hosted by the hypervisor to configured to selectively save and restore the state of branch prediction logic through separate hypervisor-mode and guest-mode and/or user-mode instructions. By doing so, different branch prediction strategies may be employed for different operating systems and user applications hosted thereby to provide finer grained optimization of the branch prediction logic.	07-25-2013
20140143557	DISTRIBUTED CHIP LEVEL POWER SYSTEM - A method, circuit arrangement, and program product for dynamically reallocating power consumption at a component level of a processor. Power tokens representative of a power consumption metric are allocated to interconnected IP blocks of the processor, and as additional power is required by an IP block to perform assigned operations, the IP block may communicate a request for additional power tokens to one or more interconnected IP blocks. The interconnected IP blocks may grant power tokens for the request based on a priority, availability, and/or power consumption target. The requesting IP block may modify power consumption based on power tokens granted by interconnected IP blocks for the request.	05-22-2014
20140143558	DISTRIBUTED CHIP LEVEL MANAGED POWER SYSTEM - A method, circuit arrangement, and program product for dynamically reallocating power consumption at a component level of a processor. Power tokens representative of a power consumption metric are allocated to interconnected IP blocks of the processor, and as additional power is required by an IP block to perform assigned operations, the IP block may communicate a request for additional power tokens to one or more interconnected IP blocks. The interconnected IP blocks may grant power tokens for the request based on a priority, availability, and/or power consumption target. The requesting IP block may modify power consumption based on power tokens granted by interconnected IP blocks for the request. A power management block may adjust power token allocation of one or more IP blocks by communicating a command to one or more IP blocks and/or by adjusting a power token request.	05-22-2014
20140149720	FLOATING POINT EXECUTION UNIT FOR CALCULATING PACKED SUM OF ABSOLUTE DIFFERENCES - A method and circuit arrangement provide support for packed sum of absolute difference operations in a floating point execution unit, e.g., a scalar or vector floating point execution unit. Existing adders in a floating point execution unit may be utilized along with minimal additional logic in the floating point execution unit to support efficient execution of a fixed point packed sum of absolute differences instruction within the floating point execution unit, often eliminating the need for a separate vector fixed point execution unit in a processor architecture, and thereby leading to less logic and circuit area, lower power consumption and lower cost.	05-29-2014
20140164464	VECTOR EXECUTION UNIT WITH PRENORMALIZATION OF DENORMAL VALUES - A method, circuit arrangement, and program product for executing instructions including denormal values for one or more operands in a vector execution unit. A denormal value operand may be prenormalized by a first processing lane of the vector execution unit upon detecting the denormal value. The prenormalized value and any other operands of the instruction may be communicated to a dot product adder of the vector execution unit. The dot product adder performs at least a portion of the floating point operation with the prenormalized value and any other operands of the instruction.	06-12-2014
20140164465	VECTOR EXECUTION UNIT WITH PRENORMALIZATION OF DENORMAL VALUES - A method, circuit arrangement, and program product for executing instructions including denormal values for one or more operands in a vector execution unit. A denormal value operand may be prenormalized by a first processing lane of the vector execution unit upon detecting the denormal value. The prenormalized value and any other operands of the instruction may be communicated to a dot product adder of the vector execution unit. The dot product adder performs at least a portion of the floating point operation with the prenormalized value and any other operands of the instruction.	06-12-2014
20140164703	CACHE SWIZZLE WITH INLINE TRANSPOSITION - A method and circuit arrangement selectively swizzle data in one or more levels of cache memory coupled to a processing unit based upon one or more swizzle-related page attributes stored in a memory address translation data structure such as an Effective To Real Translation (ERAT) or Translation Lookaside Buffer (TLB). A memory address translation data structure may be accessed, for example, in connection with a memory access request for data in a memory page, such that attributes associated with the memory page in the data structure may be used to control whether data is swizzled, and if so, how the data is to be formatted in association with handling the memory access request.	06-12-2014
20140164704	CACHE SWIZZLE WITH INLINE TRANSPOSITION - A method and circuit arrangement selectively swizzle data in one or more levels of cache memory coupled to a processing unit based upon one or more swizzle-related page attributes stored in a memory address translation data structure such as an Effective To Real Translation (ERAT) or Translation Lookaside Buffer (TLB). A memory address translation data structure may be accessed, for example, in connection with a memory access request for data in a memory page, such that attributes associated with the memory page in the data structure may be used to control whether data is swizzled, and if so, how the data is to be formatted in association with handling the memory access request.	06-12-2014
20140164731	TRANSLATION MANAGEMENT INSTRUCTIONS FOR UPDATING ADDRESS TRANSLATION DATA STRUCTURES IN REMOTE PROCESSING NODES - Translation management instructions are used in a multi-node data processing system to facilitate remote management of address translation data structures distributed throughout such a system. Thus, in multi-node data processing systems where multiple processing nodes collectively handle a workload, the address translation data structures for such nodes may be collectively managed to minimize translation misses and the performance penalties typically associated therewith.	06-12-2014
20140164732	TRANSLATION MANAGEMENT INSTRUCTIONS FOR UPDATING ADDRESS TRANSLATION DATA STRUCTURES IN REMOTE PROCESSING NODES - Translation management instructions are used in a multi-node data processing system to facilitate remote management of address translation data structures distributed throughout such a system. Thus, in multi-node data processing systems where multiple processing nodes collectively handle a workload, the address translation data structures for such nodes may be collectively managed to minimize translation misses and the performance penalties typically associated therewith.	06-12-2014
20140164734	CONCURRENT MULTIPLE INSTRUCTION ISSUE OF NON-PIPELINED INSTRUCTIONS USING NON-PIPELINED OPERATION RESOURCES IN ANOTHER PROCESSING CORE - A method and circuit arrangement utilize inactive non-pipelined operation resources in one processing core of a multi-core processing unit to execute non-pipelined instructions on behalf of another processing core in the same processing unit. Adjacent processing cores in a processing unit may be coupled together such that, for example, when one processing core's non-pipelined execution sequencer is busy, that processing core may issue into another processing core's non-pipelined execution sequencer if that other processing core's non-pipelined execution sequencer is idle, thereby providing intermittent concurrent execution of multiple non-pipelined instructions within each individual processing core.	06-12-2014
20140173296	CHIP LEVEL POWER REDUCTION USING ENCODED COMMUNICATIONS - A circuit arrangement, method, and program product communicate data over a communication bus by selectively encoding data values queued for communication over the communication bus based at least in part on at least one data value queued to be communicated thereafter and at least one previously communicated encoded data value to reduce bit transitions for communication of the encoded data values. By reducing bit transitions in the data communicated over the communication bus, power consumption by the communication bus is likewise reduced.	06-19-2014
20140173308	CHIP LEVEL POWER REDUCTION USING ENCODED COMMUNICATIONS - A circuit arrangement, method, and program product communicate data over a communication bus by selectively encoding data values queued for communication over the communication bus based at least in part on at least one data value queued to be communicated thereafter and at least one previously communicated encoded data value to reduce bit transitions for communication of the encoded data values. By reducing bit transitions in the data communicated over the communication bus, power consumption by the communication bus is likewise reduced.	06-19-2014
20150020078	THREAD SCHEDULING ACROSS HETEROGENEOUS PROCESSING ELEMENTS WITH RESOURCE MAPPING - A system, method, and program product for scheduling processes of a workload on a plurality of hardware threads configured in a plurality of processing elements of a multithreading parallel computing system for processing thereby. Process dimensions for each process are determined based on processing attributes associated with each process, and a place and route algorithm is utilized to map the processes to a processor space representative of the processing resources of the computing system based at least in part on the process dimensions to thereby distribute the processes of the workload.	01-15-2015
20150026435	INSTRUCTION SET ARCHITECTURE WITH EXTENSIBLE REGISTER ADDRESSING - A method and circuit arrangement selectively source and/or write data from/to extended registers of an extended register file based in part on whether an operand address of an instruction references a primary register of primary register file configured to store a pointer to the extended register. Control logic connected to the primary register file and the extended register file determines whether the operand address references a primary register configured to store a pointer, and responsive to the determination, the control logic causes execution logic to selectively source and/or write data from/to the extended register pointed to by the pointer stored in the referenced primary register.	01-22-2015
20150026500	GENERAL PURPOSE PROCESSING UNIT WITH LOW POWER DIGITAL SIGNAL PROCESSING (DSP) MODE - A method and circuit arrangement utilize a general purpose processing unit having a low power DSP mode for reconfiguring the general purpose processing unit to efficiently execute DSP workloads with reduced power consumption. When in a DSP mode, one or more of a data cache, an execution unit, and simultaneous multithreading may be disabled to reduce power consumption and improve performance for DSP workloads. Furthermore, partitioning of a register file to support multithreading, and register renaming functionality, may be disabled to provide an expanded set of registers for use with DSP workloads. As a result, a general purpose processing unit may be provided with enhanced performance for DSP workloads with reduced power consumption, while also not sacrificing performance for other non-DSP/general purpose workloads.	01-22-2015
20150032988	REGULAR EXPRESSION MEMORY REGION WITH INTEGRATED REGULAR EXPRESSION ENGINE - A method and circuit arrangement selectively perform regular expression matching in connection with accessing data with a processing unit based upon one or more regular expression matching-related attributes stored in a memory address translation data structure such as an Effective To Real Translation (ERAT) or Translation Lookaside Buffer (TLB). A regular expression matching-related attribute in such a data structure may be used to control whether data being communicated between the processing unit and a communications bus is routed through an expression engine integrated with the processing unit such that regular expression matching may be performed in association with the data communication.	01-29-2015
20150032999	INSTRUCTION SET ARCHITECTURE WITH OPCODE LOOKUP USING MEMORY ATTRIBUTE - A method and circuit arrangement decode instructions based in part on one or more decode-related attributes stored in a memory address translation data structure such as an Effective To Real Translation (ERAT) or Translation Lookaside Buffer (TLB). A memory address translation data structure may be accessed, for example, in connection with a decode of an instruction stored in a page of memory, such that one or more attributes associated with the page in the data structure may be used to control how that instruction is decoded.	01-29-2015

Patent applications by Paul E. Schardt, Rochester, MN US

Inventors list

Assignees list

Classification tree browser

Top 100 Inventors

Top 100 Assignees

Paul E. Schardt, Rochester US

Paul E. Schardt, Rochester, MN US