Storm

Non transactional (no ack)
  • at most once
  • possible data loss

Non transactional (with ack)

  • at least once
  • require explicit retry logic

transactional

  • exactly once
  • works well for batches
  • use transactionalTopolicyBuilder
  • implement a committer bolt

log system

log记录数据的状态,类似于版本控制系统。 mechanics of data flow: log semantics of details: metadata, schemas, compatibility, all the details of handling data structure and evolution. pipeline提供统一的数据源,方便加入新的数据处理机制。pipeline负责保证数据的结构正确。

Spark

direct kafka keeps 1:1 mapping of each kafka partition to RDD partition in streaming processing. So its better to have more partitions than cores. contains more log lines in a message. Tune batch interval. tuen maxrateperpartition to avoid skewed data taking up too much time. Uniform data distribution. lambda architecture. Cassandra is good at write.

exactly-once two-phrase commit 实现方法,实现beginTransaction, pre-commit, commit, abort two-phrase代表precommit和commit

Git

减少冲突应该限制文件大小,养成经常commit,pull的习惯。一个TICKET开一个branch。提交merge request之前先解决冲突。

Python

__all__ import * 的module

Hive

bloom filter: false positive probability只有假阳性,没有假阴性,一个set里使用多个hash函数。 orc 三种索引。文件级别的索引,给出整个文件的列统计信息。stripe level给出每个stripe的列统计信息。row level的索引没一定数量的行数生成行索引。