Note
Storm
Non transactional (no ack)
- at most once
- possible data loss
Non transactional (with ack)
- at least once
- require explicit retry logic
transactional
- exactly once
- works well for batches
- use transactionalTopolicyBuilder
- implement a committer bolt
log system
log记录数据的状态,类似于版本控制系统。 mechanics of data flow: log semantics of details: metadata, schemas, compatibility, all the details of handling data structure and evolution. pipeline提供统一的数据源,方便加入新的数据处理机制。pipeline负责保证数据的结构正确。
Spark
direct kafka keeps 1:1 mapping of each kafka partition to RDD partition in streaming processing. So its better to have more partitions than cores. contains more log lines in a message. Tune batch interval. tuen maxrateperpartition to avoid skewed data taking up too much time. Uniform data distribution. lambda architecture. Cassandra is good at write.
Flink
exactly-once two-phrase commit 实现方法,实现beginTransaction, pre-commit, commit, abort two-phrase代表precommit和commit
Git
减少冲突应该限制文件大小,养成经常commit,pull的习惯。一个TICKET开一个branch。提交merge request之前先解决冲突。
Python
__all__
import * 的module
Hive
bloom filter: false positive probability只有假阳性,没有假阴性,一个set里使用多个hash函数。 orc 三种索引。文件级别的索引,给出整个文件的列统计信息。stripe level给出每个stripe的列统计信息。row level的索引没一定数量的行数生成行索引。