Using Google Cloud Platform to Analyze Cloudflare Logs

Introduction
This tutorial covers how to setup the process of obtaining Cloudflare logs via Cloudflare API pull request and utilizing GCP (Google Cloud Platform) components like Google Cloud Storage for storing the logs, importing data into Google BigQuery and running visual reports by using Google Data Studio module. 

$500 GCP credit
Google Cloud is offering a $500 credit towards a new Google Cloud account to help you get started. In order to receive a credit, please follow these instructions.

Cloudflare Enterprise customers have two options how to set this process up:
1.Manual setup for obtaining Cloudflare logs on demand 
2.Automated setup for obtaining Cloudflare logs on a regular basis which uses a cronjob task

Data Flow Diagram

Cloudflare_logs_analysis_using_google_Cloud_platform.png

Obtaining Data Manually

This setup is good for occasional check of logs and done on demand.

Requirements and Prerequisites

Install GoLang
Please make sure on your workstation or VM you have installed Golang 1.7+ and higher. We suggest to use the latest version of Golang which is 1.9. Download here

Working with Google Cloud

  • Select or create a Google Cloud Platform Project:
  • For working with Google Cloud Project you will need to install Google Cloud SDK which you can download here:  
  • Make sure you have configured and enabled Google Billing profile by following instructions here:  
  • Make sure you have have enabled Google API for the following components here:

         - Google Cloud Storage
         - Google BigQuery
         - Cloud Function:

After successful Google Cloud SDK installation run the following command to initialize it:

./google-cloud-sdk/bin/gcloud init
  • Configure Google Application Default Credentials
    Script Logshare-cli will pull logs from Cloudflare and push them into your GCS bucket. This requires authentication between your CLI (where your script is being run from) and GCS. Follow the instructions here to authenticate
      

Configuring Google Application Default Credentials will open your default browser (the email address you are logged into GCP) to authenticate against the Google Cloud SDK. Run the following command to get started:

gcloud auth application-default login

Building the Environment
The process is divided into two phases:

  • In the first phase, we will be setting up a Cloud Function which task is to import data from the existing GCP Storage bucket into BigQuery table. This Cloud Function is triggered whenever there is a new Cloudflare log file is uploaded to Google Cloud Storage Bucket.
  • In the second phase, we will use Logshare script which will download via Cloudflare API the log files in JSON format and upload them to predefined Google Cloud Storage bucket.

Phase 1: Creating the Cloud Function
Clone GCS-To-Big-Query to your local workstation:

git clone https://github.com/cloudflare/GCS-To-Big-Query


Config.json specifies the BigQuery dataset name (can be anything) and table name (will be auto-created in BigQuery) that will be used to import the data from log files stored at GCP Storage bucket.

Example of a proper config.json file:

configjson_example.png

In your cloned repository, there are 5 files that need to be compressed and archived in order to create the Cloud Function. Using your file explorer (or command line if you know how), select the following files and ZIP them together (the name of the archive does not matter):

  • LICENSE
  • Config.json
  • Index.js
  • Package.json
  • README.md

Navigate to the GCP Cloud Function UI:

Select ‘Create Function’ and the following modal will appear:

GoogleCloudFunction.png

Name - name of the function (can be anything)
Trigger - select option “Cloud Storage bucket”
Bucket - Select GCP Storage bucket which will be used to upload Cloudflare log files (needs to be unique across GCS)
Source code - click the option “ZIP upload” and select archive you created the step described above
Stage bucket - GCP Storage bucket will be used to store and run cloud functions files

or

run the following sdk command from your workstation:

gcloud beta functions deploy <name of the cloud function> --trigger-bucket=<trigger-bucket-name> 
--source=<path to gcsToBigQuery repository on your workstation> --stage-bucket=<gs://gcs-bucket>
--entry-point=jsonLoad

where

trigger-bucket - GCP Storage bucket will be used to upload Cloudflare log files.
stage-bucket - GCP Storage bucket will be used to store and run cloud functions files.
entry-point - hardcoded value is jsonLoad

!Please note, that the trigger-bucket (storage bucket) should not be the same as the stage bucket:

gcloud beta functions deploy jsonLoad --trigger-bucket=cloudflare_logs_camiliame 
--source=. --stage-bucket=gs://cf-script-cloud-function --entry-point=jsonLoad

Once you have successfully created Cloud Function, you can move to the second phase.

Phase 2: Importing Cloudflare logs into BigQuery
Go to github and Install Cloudflare Logshare script by running the following command:

go get github.com/cloudflare/logshare/…

In order to run logshare-cli you will need to have the following information ready:

  • Cloudflare user account API Key (api-key)
  • Cloudflare user account email address (api-email)
  • Domain name (zone-name)
  • The timestamp (in Unix seconds) to request logs from (start-time). Defaults to 30 minutes behind the current time (default: 1504137645)
  • The timestamp (in Unix seconds) to request logs to (end-time). Defaults to 20 minutes behind the current time (default: 1504138245)

You could also use Instead of end-time the count which the number (count) of logs to retrieve. Pass '-1' to retrieve all logs for the given time period (default: 1).

For more options please refer to “GLOBAL OPTIONS” under section “Available Options” here

logshare-cli --api-key=<api-key> --api-email=<email> --zone-name=<zone-name> 
--start-time <ts> --count <count> --by-received --google-storage-bucket=<trigger-bucket>
--google-project-id=<project-id>

Example,

logshare-cli --api-key=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx [email protected] 
--zone-name=domain.com --start-time=1506500000 --count=100 --by-received
--google-storage-bucket=cloudflare_logs_camiliame --google-project-id=google-project-111111

by-received retrieves logs by the processing time on Cloudflare. This mode allows you to fetch all available logs vs. based on the log timestamps themselves.

After running the command successfully you should receive something like:

[logshare-cli] 18:41:59 Bucket cloudflare_logs_camiliame already exists.
[logshare-cli] 18:42:03 HTTP status 200 | 3865ms | https://api.cloudflare.com/client/v4/zones/xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/logs/received?start=1506500000&end=1506561718&count=100
[logshare-cli] 18:42:03 Retrieved 100 logs


Under defined GCP Storage bucket you will be able to find newly uploaded log file.

Google_Storage_Bucket.png

And under GCP BigQuery predefined table should be created and populated with the log data.

Google_BigQuery.png

Now you can run queries against the table to pull required data for analysis and monitoring purposes.

Please note that log file by default contains only the following fields:

  • EdgeStartTimestamp
  • EdgeResponseStatus
  • EdgeResponseBytes
  • EdgeEndTimestamp
  • ClientRequestURI
  • ClientRequestMethod
  • RayID
  • ClientRequestHost
  • ClientIP

For adding additional fields you will need to specify each field individually under --fields in your logshare-cli command including default fields as well.


In order to have all fields in the log file you can use the following command example:

logshare-cli --api-key=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx [email protected] 
--zone-name=domain.com --start-time=1506500000 --count=100
--fields CacheCacheStatus,CacheResponseBytes,CacheResponseStatus,ClientASN,
ClientCountry,ClientDeviceType,ClientIP,ClientIPClass,ClientRequestBytes,ClientRequestHost,
ClientRequestMethod,ClientRequestProtocol,ClientRequestReferer,ClientRequestURI,
ClientRequestUserAgent,ClientSSLCipher,ClientSSLProtocol,ClientSrcPort,EdgeColoID,
EdgeEndTimestamp,EdgePathingStatus,EdgeResponseBytes,EdgeResponseCompressionRatio,
EdgeResponseStatus,EdgeStartTimestamp,OriginIP,OriginResponseBytes,OriginResponseHTTPExpires,
OriginResponseHTTPLastModified,OriginResponseStatus,OriginResponseTime,RayID,WAFAction,
WAFRuleID,ZoneID --by-received --google-storage-bucket=cloudflare_logs_camiliame
--google-project-id=google-project-111111


Full list of all fields with description is listed here:
"CacheCacheStatus": "unknown | miss | expired | updating | stale | hit | ignored | bypass | revalidated",
"CacheResponseBytes": "Number of bytes returned by the cache",
"CacheResponseStatus": "HTTP status code returned by the cache to the edge: all requests (including non-cacheable ones) go through the cache: also see CacheStatus field",
"ClientASN": "Client AS number",
"ClientCountry": "Country of the client IP address",
"ClientDeviceType": "Client device type",
"ClientIP": "IP address of the client",
"ClientIPClass": "Client IP class",
"ClientRequestBytes": "Number of bytes in the client request",
"ClientRequestHost": "Host requested by the client",
"ClientRequestMethod": "HTTP method of client request",
"ClientRequestProtocol": "HTTP protocol of client request",
"ClientRequestReferer": "HTTP request referrer",
"ClientRequestURI": "URI requested by the client",
"ClientRequestUserAgent": "User agent reported by the client",
"ClientSSLCipher": "Client SSL cipher",
"ClientSSLProtocol": "Client SSL protocol",
"ClientSrcPort": "Client source port",
"EdgeColoID": "Cloudflare edge colo id",
"EdgeEndTimestamp": "Unix nanosecond timestamp the edge finished sending response to the client",
"EdgePathingStatus": "Edge pathing status",
"EdgeResponseBytes": "Number of bytes returned by the edge to the client",
"EdgeResponseCompressionRatio": "Edge response compression ratio",
"EdgeResponseStatus": "HTTP status code returned by Cloudflare to the client",
"EdgeStartTimestamp": "Unix nanosecond timestamp the edge received request from the client",
"OriginIP": "IP of the origin server",
"OriginResponseBytes": "Number of bytes returned by the origin server",
"OriginResponseHTTPExpires": "Value of the origin 'expires' header in RFC1123 format",
"OriginResponseHTTPLastModified": "Value of the origin 'last-modified' header in RFC1123 format",
"OriginResponseStatus": "Status returned by the origin server",
"OriginResponseTime": "Number of nanoseconds it took the origin to return the response to edge",
"RayID": "Ray ID of the request",
"WAFAction": "Action taken by the WAF, if triggered",
"WAFRuleID": "ID of the applied WAF rule",
"ZoneID": "Internal zone ID"

 

Obtaining Data Automatically
This setup is good for monitoring requests in real time. 

To automate the process of obtaining Cloudflare logs in predefined intervals eg. 1 min which is the default interval, 5 min, 30 min, 1 hour, 1 day, etc. please follow the following process.

Automated Process obtaining Cloudflare access logs

The script below is using different Google Cloud modules (Google Cloud Compute, Storage, CloudFunction,BigQuery). For github instructions please click here
It will execute the following:

  • Create VM micro-instance under Google Compute Engine and install all necessary components like GO library, curl, python, etc.
  • Create bucket under Google Cloud Storage to store and run cloud function files.
    Create another bucket under Google Cloud Storage to upload Cloudflare ENT access logs in JSON format.
  • Create Cloud Function which imports Cloudflare access logs from the bucket into BigQuery. Cloud Function is triggered every time new log file is uploaded into the bucket.
  • Create a cronjob on existing VM micro-instance which pulls Cloudflare access logs in repeated intervals (default interval is 1 minute) and uploads them to the bucket.
  • Create BigQuery dataset and table to process imported data.

Please follow the steps below to set up the whole process automatically: 

  1. Select or create a Google Cloud Platform Project
  2. Clone the GCS Automation Script on your local machine:
    git clone https://github.com/cloudflare/GCS-Logshare-Setup-Script
  3. Enable the Service Management API
    Select the Project you are working on and enable the API here
  4. Make sure you have configured and enabled Google Billing profile by following instructions here  
  5. Make sure you have have enabled Google API for the following components here:
     - Google Cloud Storage,
     - Google BigQuery,
     - Cloud Function
  6. Create a copy of default.config.json and rename to config.json
    mv config.default.json config.json
  7. Modify config.json with your cloudflare account details:
     - Cloudflare_api_key - Cloudflare API Key
     - Cloudflare_api_email - Cloudflare user account email address
     - Zone_name - Domain name, example mydomain.com
     - Gcs_project_id - Google Cloud Project ID
  8. Run the main orchestration script:
    bash main.sh

Please allow 5-10 minutes for the VM and other components to be setup and configured.

After running the command successfully you should receive something like:
==
Python 2.7.11
0
Updates are available for some Cloud SDK components. To install them,
please run:
$ gcloud components updateGCloud SDK already Installing. Skipping init configuration.
0
Updated property [core/project].

Updates are available for some Cloud SDK components. To install them,
please run:
$ gcloud components update

Creating gs://cf-els-vm-setupfiles-17415/...
Copying file://config.json [Content-Type=application/json]...
Copying file://gcs-initialize.sh [Content-Type=application/x-sh]...
- [2 files][ 4.5 KiB/ 4.5 KiB]
Operation completed over 2 objects/4.5 KiB.
Creating VM...

Created [https://www.googleapis.com/compute/v1/projects/deep-presence-111111/zones/us-central1-a/instances/logshare-cli-cron-17415].
NAME ZONE MACHINE_TYPE PREEMPTIBLE INTERNAL_IP EXTERNAL_IP STATUS
logshare-cli-cron-17415 us-central1-a f1-micro 10.128.0.2 35.202.173.186 RUNNING

Successfully kicked off the VM provisioning steps. The VM takes between 4-6 minutes to fully provision.

If you are seeing any issues, please share them by submitting an issue to the repository. You can view the VM's startup script progress by tailing the syslog file:
tail -f /var/log/syslog

Enjoy!
==

Script Monitoring
For monitoring the progress of the script, please SSH to your newly created VM micro-instance and use the following command:

tail -f /var/log/syslog

Please note that for simplicity the name of VM, Storage bucket, Cloud Function, BigQuery dataset and table contain the same number. In case of troubleshooting It helps to identify that all of these components belong to the same group.

Example,
Compute Engine VM micro-instance: logshare-cli-cron-17415
Storage Bucket: cf-els-vm-setupfiles-17415
Cloud Function: cflogs_upload_bucket_17415
BigQuery dataset: cloudflare_logs_17415
BigQuery table: cloudflare_els_17415

 

Analyzing data in Data Studio

To analyze and visualize logs you can use Data Studio or any other 3rd party services. Data Studio allows you within few simple steps generate graphs and charts from BigQuery table as an input data source. These reports have option to refresh the data and get real-time analytics.

Below is an example of reports built in Data Studio in Edit mode and final View mode.

Reports in Edit Mode

GoogleDataStudio_Edit_Mode.png

Reports in View Mode

GoogleDataStudio_Preview_Mode.png

 

 

Still not finding what you need?

The Cloudflare team is here to help. 95% of questions can be answered using the search tool, but if you can’t find what you need, submit a support request.

Powered by Zendesk