Storage Transfer Service

同步在线数据,一次或多次,用于备份和同步。从本地同步使用gsutil。 Cloud Storage Transfer Service manages the transfer of data to a Cloud Storage bucket. The data source can be an AWS S3 bucket, a web-accessible URL, or another Cloud Storage bucket. Cloud Storage Transfer Service is intended for bulk transfer and is optimized for data volumes greater than 1TB.

TRANSFER APPLIANCE

线下传输,速度快,一个APPLIANCE容量1pb。

BigQuery Data Transfer Service

sources: Adwords, DoubleClick Campaign Manager, DoubleClick for Publishers and YouTube 只能导入到bigquery,不能导出

Cloud Firestore

和Cloud Datastore相比,支持mobile sdk。

Cloud Dataflow

alternative spark on dataproc Being able to analyze streaming data has transformed the way organizations do business and how they respond in real-time. Having to maintain different processing frameworks to deal with batch and streaming analytics, however, increases complexity by necessitating two different pipelines. And spending time optimizing cluster utilization and resources, as you do with Spark and Hadoop, distracts from the basic objective of filtering, aggregating, and transforming your data.

Cloud Dataflow was designed to simplify big data for both streaming and batch workloads. It does this by unifying the programming model and the execution model. Instead of having to specify a cluster size and manage capacity, Cloud Dataflow is a managed service where on-demand resources are created, auto-scaled, and parallelized. As a true zero-ops service, workers are added or removed based on the demands of the job. Cloud Dataflow also deals with the common problem of straggler workers found in distributed systems by constantly monitoring, identifying, and rescheduling work, including splits, to idle workers across the cluster. Consider the following use-cases:

MapReduce replacement—Process parallel workloads where non-MapReduce processing paradigms have led to operational complexity or frustration.

User analytics—Analyze high-volume user-behavior data, such as in-game events, click stream data, and retail sales data.

Data science—Process large amounts of data to make scientific discoveries and predictions, such as genomics, weather, and financial data.

ETL—Ingest, transform, and load data into a data warehouse, such as BigQuery.

Log processing—Process continuous event-log data processing to build real time dashboards, application metrics, and alerts.

The Dataflow SDK has also been released as the open source project Apache Beam which supports execution on Apache Spark and Apache Flink. Because of its auto-scaling and ease of deployment, Cloud Dataflow is an ideal location to run Dataflow/Apache Beam workflows.

Apache Avro

数据串行化系统

Kafka

Each record consists of a key, a value, and a timestamp.

  • Producer API生产者api发布stream到一个或多个topic
  • Consumer API消费者API允许订阅一个或多个topic
  • Stream API应用程序接受处理发送消息
  • Connector API将topic和现有系统相连接 queue便于扩展,但是只有一个subscriber。pub/sub可以多个subscriber但是不好扩展。一个分区对应一个consumer instance。consumer group代表处理线程组。 一个partition只有一个consumer保证了数据顺序正确,否则无法保证数据顺序的被处理。commit log storage。比pub/sub功能复杂。数据排序和数据分发。只能保证一个partition内是有序的。

Apache Storm

实时stream处理框架

pub/sub

企业级消息中间件。全球服务。

  • 负载平衡
  • 实现异步工作流
  • 发布通知
  • 刷新异步缓存
  • 多系统日志
  • 不同线程或者设备数据流
  • 改善服务可靠性 When using JSON over REST, message data must be base64-encoded. Asynchronous publishing allows for batching and higher throughput in your application. Batching to Balance Latency and Throughput.最长7天存活时间。In general, accommodating more-than-once delivery requires your subscriber to be idempotent when processing messages. Pull delivery主动调用pull方法,可以容许batching delivery所以throughput比较好。Push delivery Pub/sub使用request方法把消息Push给subscriber。 pull对endpoint没要求,push必须有non-selfsigned-certificate。synchronous适用于不需要立刻处理的消息,不需要长时间保持活动的连接,可以选择固定数量的消息。StreamingPull has a 100% error rate表示连接断开。依靠持久性的双向链接。优化高吞吐。buffer带来的问题,比如duplicate和stuck。 Cloud Pub/Sub dynamically tunes the number of publishing forwarders that receive messages for a particular topic as the throughput changes。

message ordering

服务本身不保证顺序

  • 顺序无所谓 完美契合
  • 处理顺序无所谓 处理结果存放有顺序要求 publisher加timestamp
  • 处理顺序有要求 1加uid可以使用datastore代替pub/sub 2使用Cloud Monitoring检查oldest unacked message age

Authentication

认证方式

  • Service accounts推荐
  • User accounts end user authentication authentication确认用户身份 authorization确定用户权限
OAuth 2.0

获取access token

管理credential最佳实践

  • 不要嵌入源码,使用Cloud Key Management Service
  • 测试和生产环境使用不同的credential
  • 使用https传送credential
  • 不要把存在时间较长的credential存放在客户端app上
  • revoke token

Access Control

  • 以topic基础授予权限
  • 授予有限的权限
  • 授予一群开发者

Encryption Key Management

Indentity and access management

Google group

没有密码。无法登陆。

Cloud Identity domain

没有G Suite domain权限

Resource

Compute Engine instance和Cloud Storage bucket权限需要project权限。pub/sub不需要。service.resource.verb

Roles

permissions集合。Primitive roles,Predefined roles,Custom roles

IAM Policy

grant roles to users。

Policy hierarchy

每一层resource都可以设置一个IAM Policy。Child policies cannot restrict access granted at a higher level.

predefined roles

细粒度,由Google维护

Service account

属于应用程序,使用该账号调用Google API。不仅是identity还是resource。

Data Studio

绿色dimension 蓝色metrics filter控件group部分控制 source Google产品

cache

首先看query cache.如果调用同一个query。然后看prefetch cache。The prefetch cache is only active for data sources that use owner’s credentials to access the underlying data. Both the query cache and prefetch cache automatically expire periodically. You can’t disable the query cache, as doing so could result in higher data usage costs for paid data sources, such as BigQuery.

Stackdriver Logging is a centralized log-management service that collects log data from applications running on GCP and other public and private cloud platforms. Commonly, streaming data is used for telemetry. Most streaming data is generated by users or systems distributed across the globe. Cloud Pub/Sub has global endpoints and leverages Google’s global front-end load balancer to support data ingestion across all GCP regions, with minimal latency.

Datastore

property value可以是array,一个key的property可以有多个value。从而再index table里,一个entity有多个entry.所以会有组合爆炸问题。自增长ID是数字,custom name是字符串。