Patent application number | Description | Published |
20090049353 | SCHEME TO OPTIMIZE SCAN CHAIN ORDERING IN DESIGNS - A method for optimizing a scan chain ordering in circuit designs in an electronic computer-aided design system is provided. The method comprising: creating a schematic representative of a circuit design having a first cell and a second cell, the first cell and the second cell each having latches therein; creating a scan input pin and a scan output pin for each of the latches in the first cell and the second cell on the schematic; generating a first label on the schematic to provide a first wiring arrangement for the latches in the circuit design, the first wiring arrangement identifies a first order to which the scan input of each of the latches is wired to the scan output of another one of the latches; creating a layout representative of the circuit design; generating a first scan chain having a first length on the layout based on the first wiring arrangement; creating a second scan chain from the first scan chain on the layout, the second scan chain having a second length less than the first length of the first scan chain; and generating a second label on the schematic based on the second scan chain, the second label provides a second wiring arrangement for the latches in the circuit design, the second wiring arrangement identifies a second order to which the scan input of each of the latches is wired to the scan output of another one of the latches. | 02-19-2009 |
20090177870 | Method and System for a Wiring-Efficient Permute Unit - A method of providing wiring efficiency in a permute unit. Multiple selectors receive input data and shared control signals from multiple register files. The permute unit includes multiple multiplexors (MUXs) coupled to multiple logical AND gates. The multiple logical AND gates are coupled to multiple logical OR gates. The logical AND gates are physically separated from the logical OR gates. The logical AND gates receive input from one or more output data signals from the selectors. The logical OR gates combine the one or more output signals from the logical AND gates and provide output data from the permute unit. | 07-09-2009 |
20100095086 | Dynamically Aligning Enhanced Precision Vectors Based on Addresses Corresponding to Reduced Precision Vectors - Mechanisms for aligning enhanced precision vectors based on reduced precision data values are provided. At least one data value, having a first precision type, is received for storing in a vector register. The vector register stores data as a vector having a plurality of vector elements. The first precision type is modified to have a second precision type different in precision than the first precision type to thereby generate at least one modified data value. The at least one modified data value is stored in at least one vector element of the plurality of vector elements. An alignment of the at least one modified data value is determined relative to a boundary of a vector element of the vector register. An alignment operation to re-align the at least one modified data value based on the boundary of the vector element of the vector register is performed. | 04-15-2010 |
20100095087 | Dynamic Data Driven Alignment and Data Formatting in a Floating-Point SIMD Architecture - Mechanisms are provided for dynamic data driven alignment and data formatting in a floating point SIMD architecture. At least two operand inputs are input to a permute unit of a processor. Each operand input contains at least one floating point value upon which a permute operation is to be performed by the permute unit. A control vector input, having a plurality of floating point values that together constitute the control vector input, is input to the permute unit of the processor for controlling the permute operation of the permute unit. The permute unit performs a permute operation on the at least two operand inputs according to a permutation pattern specified by the plurality of floating point values that constitute the control vector input. Moreover, a result output of the permute operation is output from the permute unit to a result vector register of the processor. | 04-15-2010 |
20100218155 | Automated Critical Area Allocation in a Physical Synthesized Hierarchical Design - A method, computer program product, and data processing system for efficiently performing automated placement of timing-critical unit-level cells in a hierarchical integrated circuit design is disclosed. In preparation for global optimization the entire unit at the cell level, macro-level cells are assigned a “placement force” that serves to limit the movement of the macro-level cells from their current position. Movement boundaries for each macro element are also defined, so as to keep the components in a given macro element in relative proximity to each other. | 08-26-2010 |
20110221473 | SOFT ERROR DETECTION FOR LATCHES - A system and method for soft error detection in digital ICs is disclosed. The system includes an observing circuit coupled to a latch, which circuit is capable of a response upon a state change of the latch. The system further includes synchronized clocking provided to the latch and to the observing circuit. For the latch, the clocking defines a window in time during which the latch is prevented from receiving data, and in a synchronized manner the clocking is enabling a response in the observing circuit. The clocking is synchronized in such a manner that the circuit is enabled for its response only inside the window when the latch is prevented from receiving data. The system may also have additional circuits that are respectively coupled to latches, with each the additional circuit and its respective latch receiving the synchronized clocking Responses of a plurality of circuits may be coupled in a configuration corresponding to a logical OR. | 09-15-2011 |
20130246737 | SIMD Compare Instruction Using Permute Logic for Distributed Register Files - Mechanisms, in a data processing system comprising a single instruction multiple data (SIMD) processor, for performing a data dependency check operation on vector element values of at least two input vector registers are provided. Two calls to a simd-check instruction are performed, one with input vector registers having a first order and one with the input vector registers having a different order. The simd-check instruction performs comparisons to determine if any data dependencies are present. Results of the two calls to the simd-check instruction are obtained and used to determine if any data dependencies are present in the at least two input vector registers. Based on the results, the SIMD processor may perform various operations. | 09-19-2013 |
20140040592 | ACTIVE BUFFERED MEMORY - According to one embodiment of the present invention, a method for operating a memory device that includes memory and a processing element includes receiving, in the processing element, a command from a requestor, loading, in the processing element, a program based on the command, the program comprising a load instruction loaded from a first memory location in the memory, and performing, by the processing element, the program, the performing including loading data in the processing element from a second memory location in the memory. The method also includes generating, by the processing element, a virtual address of the second memory location based on the load instruction and translating, by the processing element, the virtual address into a real address. | 02-06-2014 |
20140040596 | PACKED LOAD/STORE WITH GATHER/SCATTER - Embodiments relate to packed loading and storing of data. An aspect includes a method for packed loading and storing of data distributed in a system that includes memory and a processing element. The method includes fetching and decoding an instruction for execution by the processing element. The processing element gathers a plurality of individually addressable data elements from non-contiguous locations in the memory which are narrower than a nominal width of register file elements in the processing element based on the instruction. The data elements are packed and loaded into register file elements of a register file entry by the processing element based on the instruction, such that at least two of the data elements gathered from the non-contiguous locations in the memory are packed and loaded into a single register file element of the register file entry. | 02-06-2014 |
20140040597 | PREDICATION IN A VECTOR PROCESSOR - Embodiments relate to vector processor predication in an active memory device. An aspect includes a system for vector processor predication in an active memory device. The system includes memory in the active memory device and a processing element in the active memory device. The processing element is configured to perform a method including decoding an instruction with a plurality of sub-instructions to execute in parallel. One or more mask bits are accessed from a vector mask register in the processing element. The one or more mask bits are applied by the processing element to predicate operation of a unit in the processing element associated with at least one of the sub-instructions. | 02-06-2014 |
20140040598 | VECTOR PROCESSING IN AN ACTIVE MEMORY DEVICE - Embodiments relate to vector processing in an active memory device. An aspect includes a system for vector processing in an active memory device. The system includes memory in the active memory device and a processing element in the active memory device. The processing element is configured to perform a method including decoding an instruction with a plurality of sub-instructions to execute in parallel. An iteration count to repeat execution of the sub-instructions in parallel is determined. Execution of the sub-instructions is repeated in parallel for multiple iterations, by the processing element, based on the iteration count. Multiple locations in the memory are accessed in parallel based on the execution of the sub-instructions. | 02-06-2014 |
20140040599 | PACKED LOAD/STORE WITH GATHER/SCATTER - Embodiments relate to packed loading and storing of data. An aspect includes a system for packed loading and storing of distributed data. The system includes memory and a processing element configured to communicate with the memory. The processing element is configured to perform a method including fetching and decoding an instruction for execution by the processing element. A plurality of individually addressable data elements is gathered from non-contiguous locations in the memory which are narrower than a nominal width of register file elements in the processing element based on the instruction. The processing element packs and loads the data elements into register file elements of a register file entry based on the instruction, such that at least two of the data elements gathered from the non-contiguous locations in the memory are packed and loaded into a single register file element of the register file entry. | 02-06-2014 |
20140040601 | PREDICATION IN A VECTOR PROCESSOR - Embodiments relate to vector processor predication in an active memory device. An aspect includes a method for vector processor predication in an active memory device that includes memory and a processing element. The method includes decoding, in the processing element, an instruction including a plurality of sub-instructions to execute in parallel. One or more mask bits are accessed from a vector mask register in the processing element. The one or more mask bits are applied by the processing element to predicate operation of a unit in the processing element associated with at least one of the sub-instructions. | 02-06-2014 |
20140040603 | VECTOR PROCESSING IN AN ACTIVE MEMORY DEVICE - Embodiments relate to vector processing in an active memory device. An aspect includes a method for vector processing in an active memory device that includes memory and a processing element. The method includes decoding, in the processing element, an instruction including a plurality of sub-instructions to execute in parallel. An iteration count to repeat execution of the sub-instructions in parallel is determined. Based on the iteration count, execution of the sub-instructions in parallel is repeated for multiple iterations by the processing element. Multiple locations in the memory are accessed in parallel based on the execution of the sub-instructions. | 02-06-2014 |
20140047211 | VECTOR REGISTER FILE - An aspect includes accessing a vector register in a vector register file. The vector register file includes a plurality of vector registers and each vector register includes a plurality of elements. A read command is received at a read port of the vector register file. The read command specifies a vector register address. The vector register address is decoded by an address decoder to determine a selected vector register of the vector register file. An element address is determined for one of the plurality of elements associated with the selected vector register based on a read element counter of the selected vector register. A word is selected in a memory array of the selected vector register as read data based on the element address. The read data is output from the selected vector register based on the decoding of the vector register address by the address decoder. | 02-13-2014 |
20140047214 | VECTOR REGISTER FILE - An aspect includes accessing a vector register in a vector register file. The vector register file includes a plurality of vector registers and each vector register includes a plurality of elements. A read command is received at a read port of the vector register file. The read command specifies a vector register address. The vector register address is decoded by an address decoder to determine a selected vector register of the vector register file. An element address is determined for one of the plurality of elements associated with the selected vector register based on a read element counter of the selected vector register. A word is selected in a memory array of the selected vector register as read data based on the element address. The read data is output from the selected vector register based on the decoding of the vector register address by the address decoder. | 02-13-2014 |
20140115294 | MEMORY PAGE MANAGEMENT - According to one embodiment, a method for operating a memory device includes receiving a first request from a requestor, wherein the first request includes accessing data at a first memory location in a memory bank, opening a first page in the memory bank, wherein opening the first page includes loading a row including the first memory location into a buffer, the row being loaded from a row location in the memory bank and transmitting the data from the first memory location to the requestor. The method also includes determining, by a memory controller, whether to close the first page following execution of the first request based on information relating to a likelihood that a subsequent request will access the first page. | 04-24-2014 |
20140129799 | ADDRESS GENERATION IN AN ACTIVE MEMORY DEVICE - Embodiments relate to address generation in an active memory device that includes memory and a processing element. An aspect includes a method for address generation in the active memory device. The method includes reading a base address value and an offset address value from a register file group of the processing element. The processing element determines a virtual address based on the base address value and the offset address value. The processing element translates the virtual address into a physical address and accesses a location in the memory based on the physical address. | 05-08-2014 |
20140130050 | MAIN PROCESSOR SUPPORT OF TASKS PERFORMED IN MEMORY - According to one embodiment of the present invention, a method for operating a computer system including a main processor, a processing element and memory is provided. The method includes receiving, at the processing element, a task from the main processor, performing, by the processing element, an instruction specified by the task, determining, by the processing element, that a function is to be executed on the main processor, the function being part of the task, sending, by the processing element, a request to the main processor for execution, the request comprising execution of the function and receiving, at the processing element, an indication that the main processor has completed execution of the function specified by the request. | 05-08-2014 |
20140130051 | MAIN PROCESSOR SUPPORT OF TASKS PERFORMED IN MEMORY - According to one embodiment of the present invention, a computer system for executing a task includes a main processor, a processing element and memory. The computer system is configured to perform a method including receiving, at the processing element, the task from the main processor, performing, by the processing element, an instruction specified by the task, determining, by the processing element, that a function is to be executed on the main processor, the function being part of the task, sending, by the processing element, a request to the main processor for execution, the request including execution of the function and receiving, at the processing element, an indication that the main processor has completed execution of the function specified by the request. | 05-08-2014 |
20140136811 | ACTIVE MEMORY DEVICE GATHER, SCATTER, AND FILTER - Embodiments relate to loading and storing of data. An aspect includes a method for transferring data in an active memory device that includes memory and a processing element. An instruction is fetched and decoded for execution by the processing element. Based on determining that the instruction is a gather instruction, the processing element determines a plurality of source addresses in the memory from which to gather data elements and a destination address in the memory. One or more gathered data elements are transferred from the source addresses to contiguous locations in the memory starting at the destination address. Based on determining that the instruction is a scatter instruction, a source address in the memory from which to read data elements at contiguous locations and one or more destination addresses in the memory to store the data elements at non-contiguous locations are determined, and the data elements are transferred. | 05-15-2014 |
20140136894 | EXPOSED-PIPELINE PROCESSING ELEMENT WITH ROLLBACK - An aspect includes providing rollback support in an exposed-pipeline processing element. A method for providing rollback support in an exposed-pipeline processing element includes detecting, by rollback support logic, an error associated with execution of an instruction in the exposed-pipeline processing element. The rollback support logic determines whether the exposed-pipeline processing element supports replay of the instruction for a predetermined number of cycles. Based on determining that the exposed-pipeline processing element supports replay of the instruction, a rollback action is performed in the exposed-pipeline processing element to attempt recovery from the error. | 05-15-2014 |
20140136895 | EXPOSED-PIPELINE PROCESSING ELEMENT WITH ROLLBACK - An aspect includes providing rollback support in an exposed-pipeline processing element. A system includes the exposed-pipeline processing element with rollback support logic. The rollback support logic is configured to detect an error associated with execution of an instruction in the exposed-pipeline processing element. The rollback support logic determines whether the exposed-pipeline processing element supports replay of the instruction for a predetermined number of cycles. Based on determining that the exposed-pipeline processing element supports replay of the instruction, a rollback action is performed in the exposed-pipeline processing element to attempt recovery from the error. | 05-15-2014 |
20140149673 | LOW LATENCY DATA EXCHANGE - According to one embodiment, a method for exchanging data in a system that includes a main processor in communication with an active memory device is provided. The method includes a processing element in the active memory device receiving an instruction from the main processor and receiving a store request from a thread running on the main processor, the store request specifying a memory address associated with the processing element. The method also includes storing a value provided in the store request in a queue in the processing element and the processing element performing the instruction using the value from the queue. | 05-29-2014 |
20140149680 | LOW LATENCY DATA EXCHANGE - According to one embodiment, a method for exchanging data in a system that includes a main processor in communication with an active memory device is provided. The method includes a processing element in the active memory device receiving an instruction from the main processor and receiving a store request from a thread running on the main processor, the store request specifying a memory address associated with the processing element. The method also includes storing a value provided in the store request in a queue in the processing element and the processing element performing the instruction using the value from the queue. | 05-29-2014 |
20140173224 | SEQUENTIAL LOCATION ACCESSES IN AN ACTIVE MEMORY DEVICE - Embodiments relate to sequential location accesses in an active memory device that includes memory and a processing element. An aspect includes a method for sequential location accesses that includes receiving from the memory a first group of data values associated with a queue entry at the processing element. A tag value associated with the queue entry and specifying a position from which to extract a first subset of the data values is read. The queue entry is populated with the first subset of the data values starting at the position specified by the tag value. The processing element determines whether a second subset of the data values in the first group of data values is associated with a subsequent queue entry, and populates a portion of the subsequent queue entry with the second subset of the data values. | 06-19-2014 |
20140195743 | ON-CHIP TRAFFIC PRIORITIZATION IN MEMORY - According to one embodiment, a method for traffic prioritization in a memory device includes sending a memory access request including a priority value from a processing element in the memory device to a crossbar interconnect in the memory device. The memory access request is routed through the crossbar interconnect to a memory controller in the memory device associated with the memory access request. The memory access request is received at the memory controller. The priority value of the memory access request is compared to priority values of a plurality of memory access requests stored in a queue of the memory controller to determine a highest priority memory access request. A next memory access request is performed by the memory controller based on the highest priority memory access request. | 07-10-2014 |
20140195744 | ON-CHIP TRAFFIC PRIORITIZATION IN MEMORY - According to one embodiment, a memory device is provided. The memory device includes a processing element coupled to a crossbar interconnect. The processing element is configured to send a memory access request, including a priority value, to the crossbar interconnect. The crossbar interconnect is configured to route the memory access request to a memory controller associated with the memory access request. The memory controller is coupled to memory and to the crossbar interconnect. The memory controller includes a queue and is configured to compare the priority value of the memory access request to priority values of a plurality of memory access requests stored in the queue of the memory controller to determine a highest priority memory access request and perform a next memory access request based on the highest priority memory access request. | 07-10-2014 |
20140281084 | LOCAL BYPASS FOR IN MEMORY COMPUTING - Embodiments include a method for bypassing data in an active memory device. The method includes a requestor determining a number of transfers to a grantor that have not been communicated to the grantor, requesting to the interconnect network that the bypass path be used for the transfers based on the number of transfers meeting a threshold and communicating the transfers via the bypass path to the grantor based on the request, the interconnect network granting control of the grantor in response to the request. The method also includes the interconnect network requesting control of the grantor based on an event and communicating delayed transfers via the interconnect network from other requestors, the delayed transfers being delayed due to the grantor being previously controlled by the requestor, the communicating based on the control of the grantor being changed back to the interconnect network. | 09-18-2014 |
20140281100 | LOCAL BYPASS FOR IN MEMORY COMPUTING - Embodiments include a method for bypassing data in an active memory device. The method includes a requestor determining a number of transfers to a grantor that have not been communicated to the grantor, requesting to the interconnect network that the bypass path be used for the transfers based on the number of transfers meeting a threshold and communicating the transfers via the bypass path to the grantor based on the request, the interconnect network granting control of the grantor in response to the request. The method also includes the interconnect network requesting control of the grantor based on an event and communicating delayed transfers via the interconnect network from other requestors, the delayed transfers being delayed due to the grantor being previously controlled by the requestor, the communicating based on the control of the grantor being changed back to the interconnect network. | 09-18-2014 |
20140281386 | CHAINING BETWEEN EXPOSED VECTOR PIPELINES - Embodiments include a method for chaining data in an exposed-pipeline processing element. The method includes separating a multiple instruction word into a first sub-instruction and a second sub-instruction, receiving the first sub-instruction and the second sub-instruction in the exposed-pipeline processing element. The method also includes issuing the first sub-instruction at a first time, issuing the second sub-instruction at a second time different than the first time, the second time being offset to account for a dependency of the second sub-instruction on a first result from the first sub-instruction, the first pipeline performing the first sub-instruction at a first clock cycle and communicating the first result from performing the first sub-instruction to a chaining bus coupled to the first pipeline and a second pipeline, the communicating at a second clock cycle subsequent to the first clock cycle that corresponds to a total number of latch pipeline stages in the first pipeline. | 09-18-2014 |
20140281403 | CHAINING BETWEEN EXPOSED VECTOR PIPELINES - Embodiments include a method for chaining data in an exposed-pipeline processing element. The method includes separating a multiple instruction word into a first sub-instruction and a second sub-instruction, receiving the first sub-instruction and the second sub-instruction in the exposed-pipeline processing element. The method also includes issuing the first sub-instruction at a first time, issuing the second sub-instruction at a second time different than the first time, the second time being offset to account for a dependency of the second sub-instruction on a first result from the first sub-instruction, the first pipeline performing the first sub-instruction at a first clock cycle and communicating the first result from performing the first sub-instruction to a chaining bus coupled to the first pipeline and a second pipeline, the communicating at a second clock cycle subsequent to the first clock cycle that corresponds to a total number of latch pipeline stages in the first pipeline. | 09-18-2014 |
20140281605 | POWER MANAGEMENT FOR A COMPUTER SYSTEM - Embodiments include a method for managing power in a computer system including a main processor and an active memory device including powered units, the active memory device in communication with the main processor by a memory link, the powered units including a processing element. The method includes the main processor executing a program on a program thread, encountering a first section of code to be executed by the active memory device, changing, by a first command, a power state of a powered unit on the active memory device based on the main processor encountering the first section of code, the first command including a store command. The method also includes the processing element executing the first section of code at a second time, changing a power state of the main processor from a power use state to a power saving state based on the processing element executing the first section. | 09-18-2014 |
20140281629 | POWER MANAGEMENT FOR A COMPUTER SYSTEM - Embodiments include a method for managing power in a computer system including a main processor and an active memory device including powered units, the active memory device in communication with the main processor by a memory link, the powered units including a processing element. The method includes the main processor executing a program on a program thread, encountering a first section of code to be executed by the active memory device, changing, by a first command, a power state of a powered unit on the active memory device based on the main processor encountering the first section of code, the first command including a store command. The method also includes the processing element executing the first section of code at a second time, changing a power state of the main processor from a power use state to a power saving state based on the processing element executing the first section. | 09-18-2014 |
20150084673 | MARGIN IMPROVEMENT FOR CONFIGURABLE LOCAL CLOCK BUFFER - A timing margin circuit of a local clock buffer circuit may include an inverter logic gate having an inverter input and an inverter output, whereby the inverter input receives an input clock signal. A NAND logic gate includes a first NAND input coupled to the inverter output, a second NAND input, and a NAND output. The circuit also includes a logic device having a first logic device input that is coupled to the inverter output, a second logic device input that receives a mode selection signal, and a logic device output that couples to the second NAND input, whereby the NAND logic gate generates a first time delayed input clock signal and a second time delayed input clock signal, such that the first and the second time delayed input clock signal control a falling edge transition of a local clock signal derived from the input clock signal. | 03-26-2015 |
20150177811 | POWER MANAGEMENT FOR IN-MEMORY COMPUTER SYSTEMS - According to one embodiment, a method for power management of a compute node including at least two power-consuming components is provided. A power capping control system compares power consumption level of the compute node to a power cap. Based on determining that the power consumption level is greater than the power cap, actions are performed including: reducing power provided to a first power-consuming component based on determining that it has an activity level below a first threshold and that power can be reduced to the first power-consuming component. Power provided to a second power-consuming component is reduced based on determining that it has an activity level below a second threshold and that power can be reduced to the second power-consuming component. Power reduction is forced in the compute node based on determining that power cannot be reduced in either of the first or second power-consuming component. | 06-25-2015 |