Добавить
Уведомления

DataStage Parallelism

In the last few videos, I gave a brief overview of SMP and MPP systems. In this video, I will give an overview of how parallelism is achieved in DataStage jobs. DataStage parallel jobs are normally used to read data from different source systems, clean, transform, apply business logic on the data and load the structured data to Data Warehouse tables. Because of parallelism, the overall job completes within a short span of time even when the input records are large in number. DataStage jobs achieve parallelism using any of the following methods. Data Pipelining Data Partitioning Combining both Pipelining and Partitioning Dynamic Data Repartitioning Data Pipelining: In Data Pipelining method, the source data is read and processed in segments. Assume the source stage has 10,000 lines of records. The job is set up to pick up 1000 lines of records at a time and pass it over to the next stages like Transformer, lookup, filter etc. The stages are running in different nodes or processors depending on the number of nodes that has been defined in the DataStage job’s configuration file. If its SMP architecture, then different CPUs in the system could be used as different nodes. If its MPP architecture, then different computers will be used as a node. So, all the intermediate stages between source and target will be running simultaneously in different processors. As soon as a segment of data is received by a stage it will process it and send it to the next stage. So, by the time the next segment of data is being read from the source, the available segment of data will be processed simultaneously and sent to the next stage, and so on. So, the data segments are continuously flowing from source to Target Datawarehouse table as shown in the figure. What happens if Data Pipelining is not employed? In case all the stages are running in the same node (processor), only one stage can be active at any point of time. So, first all the records from the source should be read and stored in the temporary memory space. Then this data is processed by the next stage, and so on until its written to the target table. This increases the total time taken by the job to complete. Also, it increases the necessity to have more memory when input data size is very high. Data Partitioning: In Data partitioning method, the input records are divided into partitions based on the method of partitioning defined in the job stages. The total number of partitions depends on the number of processing nodes defined in the configuration file. The DataStage job creates instances of the stages in all the defined processing nodes and all these instances run simultaneously processing the respective partitioned data that it received. DataStage partitions the data using any of the following methods. Round robin partitioner Random partitioner Same partitioner Entire partitioner Hash partitioner Modulus partitioner Range partitioner DB2 partitioner Auto partitioner Combining data partitioning and pipelining We can combine data pipelining and partitioning, and achieve even greater performance Dynamic Data Repartitioning In Dynamic data repartitioning method, the data between intermediate stages in the DataStage job can be repartitioned according to the requirement. For instance, you have initially processed data based on customer last name, but now want to process on data grouped by city. You will need to repartition to ensure that all customers belonging to the same city are in the same partition. Beware: Repartitioning data between stages can affect performance of the job and also the balance of the partitions (unequal sized partitions). Thanks for watching the video. If you have questions or feedback please post it in the comments section. Have a nice day!

Иконка канала Ноутбуки Мечты
4 подписчика
12+
16 просмотров
2 года назад
12+
16 просмотров
2 года назад

In the last few videos, I gave a brief overview of SMP and MPP systems. In this video, I will give an overview of how parallelism is achieved in DataStage jobs. DataStage parallel jobs are normally used to read data from different source systems, clean, transform, apply business logic on the data and load the structured data to Data Warehouse tables. Because of parallelism, the overall job completes within a short span of time even when the input records are large in number. DataStage jobs achieve parallelism using any of the following methods. Data Pipelining Data Partitioning Combining both Pipelining and Partitioning Dynamic Data Repartitioning Data Pipelining: In Data Pipelining method, the source data is read and processed in segments. Assume the source stage has 10,000 lines of records. The job is set up to pick up 1000 lines of records at a time and pass it over to the next stages like Transformer, lookup, filter etc. The stages are running in different nodes or processors depending on the number of nodes that has been defined in the DataStage job’s configuration file. If its SMP architecture, then different CPUs in the system could be used as different nodes. If its MPP architecture, then different computers will be used as a node. So, all the intermediate stages between source and target will be running simultaneously in different processors. As soon as a segment of data is received by a stage it will process it and send it to the next stage. So, by the time the next segment of data is being read from the source, the available segment of data will be processed simultaneously and sent to the next stage, and so on. So, the data segments are continuously flowing from source to Target Datawarehouse table as shown in the figure. What happens if Data Pipelining is not employed? In case all the stages are running in the same node (processor), only one stage can be active at any point of time. So, first all the records from the source should be read and stored in the temporary memory space. Then this data is processed by the next stage, and so on until its written to the target table. This increases the total time taken by the job to complete. Also, it increases the necessity to have more memory when input data size is very high. Data Partitioning: In Data partitioning method, the input records are divided into partitions based on the method of partitioning defined in the job stages. The total number of partitions depends on the number of processing nodes defined in the configuration file. The DataStage job creates instances of the stages in all the defined processing nodes and all these instances run simultaneously processing the respective partitioned data that it received. DataStage partitions the data using any of the following methods. Round robin partitioner Random partitioner Same partitioner Entire partitioner Hash partitioner Modulus partitioner Range partitioner DB2 partitioner Auto partitioner Combining data partitioning and pipelining We can combine data pipelining and partitioning, and achieve even greater performance Dynamic Data Repartitioning In Dynamic data repartitioning method, the data between intermediate stages in the DataStage job can be repartitioned according to the requirement. For instance, you have initially processed data based on customer last name, but now want to process on data grouped by city. You will need to repartition to ensure that all customers belonging to the same city are in the same partition. Beware: Repartitioning data between stages can affect performance of the job and also the balance of the partitions (unequal sized partitions). Thanks for watching the video. If you have questions or feedback please post it in the comments section. Have a nice day!

, чтобы оставлять комментарии