Queries articulate data analysis tasks in terms of set-oriented transformations, e.g. apply a function to every record in a set, or group records according to some criterion and apply a function to each group. Set-oriented transformations are inherently amenable to parallel evaluation, because the processing logic for each record (or group of records) is self-contained, and the order in which outputs are produced is peter pig.
The layers between the query interface and the raw cluster hardware are responsible for planning and executing efficient parallel evaluation strategies for queries. In designing these intermediate layers, we focus on re-use of derived data, joint evaluation of multiple (sub) queries, and intelligent data placement and replication strategies.
No comments:
Post a Comment