Spark基础知识整理
####Hadoop cluster set up
set dataname/namenode data dir. default value is /tmp/… will be cleaned after reboot.
SparkSQL vs Hive on Spark
基本一件东西。SparkSQL仅仅使用Hive metastore提高速度。使用Spark的SQL处理器。Hive on Spark使用Spark执行引擎,自己的SQL处理器。
Spark aimed for fast realtime interactive analytics. need large amount of memory. unified stack for streaming, interactive and predictive analysis. RDD distributed table.
- driver program (spark context)
- worker node
spark context to connect to the cluster spark session a unified entrypoint (spark context, streaming, sql) spark applications are seperated. deploy mode client: driver outside of the cluster, cluster mode: driver inside of the cluster task->stage->job stage consists of a collection of tasks running same code on different set of data. a stage is without shuffle.
- no execution occurs until an action
- rdd recomputed with each action use persist to retain in memeory
DataFrame is RDD in tabular format, with query optimization.
flatmap applies to every element that returns multiple element into a new rdd.
persist DataFrame by temporary view or shared table.
inferred schema performance overhead
PySpark is built on top of Spark’s Java API. Data is processed in Python and cached /shuffled in the JVM.
Spark解决data skew问题的思路。如果join列上的值分布很不均匀,那么分区时会导致数据偏移,某些分区上的数据量特别大,拖慢整个JOB的执行速度。解决方法是 在列上加随机数,扩大分区的数量,join结束之后再恢复原始数据。
减少随机访问,尽量顺序访问。