Apr 16, 2020
BigQuery is Google’s serverless, highly scalable, enterprise data warehouse. This large-scaled data bank was designed to make a data analyst’s job more productive with unmatched price-performances.
Simply put, BigQuery is where Google gives analysts a place to store and manage petabytes of data without the hassle of managing the infrastructure.
In today’s age managing an onsite data warehouse is very challenging, expensive, and time-consuming.
Google saw the impending issue of big data and how it would impact businesses and the way they use data for analytics and insights. Thanks to Google this has cut time and my personal stress in half.
Instead of buying multimillion-dollar servers that would need to be upgraded as demands, Google increased by scaling out and buying a lot of cheaper servers. This now would be Google’s mission of figuring out a way to make them work together to meet demands.
One would say Google scaled out and not up. And when it came time to scale more, the company bought a few more machines, instead of having to upgrade the multimillion-dollar server.
This helped Google create a high-performance, cost-effective, data warehouse for pretty much anyone that has and more importantly needs to analyze data.
How It Works
Google BigQuery is a cloud-based, fully managed, serverless enterprise data warehouse that supports analytics over petabyte-scale data. It delivers high-speed analysis of large datasets without requiring investments in onsite infrastructure or database administrators.
The most impressive and unique feature BigQuery has is its ability to scale up and down depending on the need of each query, it will add and remove computing power as well as storage resources as required. This is how BigQuery can run queries on terabytes of data in seconds.
Let us take a step back and define what a query exactly is.
A query is simply a question. When one types a search term or terms into Google they are running a “search query” the keywords are sent to the search engine and are processed using an algorithm that retrieves related results from the search index.
The results of the query appear on a search engine results page.
When it comes to BigQuery one would run a database query, the database stores data in a structured format.
When one runs their query they are essentially doing the same as when they type in search terms in Google, the query runs scouring the structured data finding what is needed and returning it in the form of pictorials, graphs or complex data manipulations.
Google BigQuery is a part of the Google Cloud Platform and works in harmony, but both are not required for the other. BigQuery is designed to help expedite the analysis of big data. While ultimately eliminating storage as well as overhead costs.
This then also eliminates the complex nature of maintaining onsite hardware and administrative resources. Below is a list of some of the advantages BigQuery has to offer according to Enterprise Strategy Group.
Time To Value
Users can get their data warehouse environment online quickly and easily, without requiring expert-level system and database administration skills by eliminating the infrastructure and reducing the management (known as “No-Ops” or “Zero-Ops”).
Simplicity
Complete all major tasks related to data warehouse analytics through an intuitive interface without the hassle of managing the infrastructure.
Scalability
Scale up to petabytes or down to kilobytes depending on your size, performance, and cost requirements.
Speed
Ingest, query, and export PB-sized datasets with impressive speeds using the Google Cloud Platform as the underlying cloud infrastructure.
Reliability
Ensure always-on availability and constant uptime running on the Google Cloud Platform with geo-replication across Google data centers.
Security
Protect and control access to encrypted projects and datasets through Google’s cloud-wide identity and access management (IAM).
Cost Optimization
Predict costs with transparent flat rate and/or pay-as-you-go pricing, and contain costs through the use of project and user resource quotas.
Bigquery is Self-Scaling
Google BigQuery is self-scaling, which means it identifies the resources required for each query to finish quickly and efficiently and provides those resources to meet the demand. Once the demand has been met and the workload completed, it will reallocate the resources that were required to finish the task quickly and efficiently to other projects and other users who need it at that moment.
This is how BigQuery successfully runs queries on massive amounts of data. Both in transferring data in and in processing that data for results, BigQuery is capable of delivering immense speeds even at petabyte scales.
For enhanced data durability, BigQuery provides high availability and reliability through geographic replication that is completely transparent to its users, and without the requirement to obtain the physical resources and space to house it all.
Google BigQuery enables organizations to combat the cost and complexity challenges that come with building and maintaining a fast, scalable, and reliable big data infrastructure.
By utilizing Google BigQuery’s cloud-based approach, the time and cost traditionally dedicated to protecting data and guaranteeing the time needed for computing is drastically reduced.
Best Practices For Controlling Costs
Now, Google’s BigQuery gives not one, but all the necessary tools to make it what they need. Only pay for the storage and compute resources used, thanks to BigQuery’s serverless architecture.
BigQuery’s separation of storage and computing makes it easy to scale independently and endlessly on demand, resulting in low-cost, economical storage.
BigQuery can lower the total cost of ownership by 56%–88%. One can analyze up to 1 TB of data and store 10 GB of data for free each month.
But if someone is not paying close attention to the way they are running their queries they could incur costs not expected.
This is because of BigQuery charges based on the size of the queries one runs. If one needs to run a query that only needs to pull from, let us say, 10 columns or rows, and they run a full query on all the columns or rows, they will be incurring the costs of running a full query on all of their data in that table.
If one has terabytes of data this query could cost them a pretty penny.
This is when setting up tables based on queries frequently run will save money in the long run. This is where we put the time and effort into creating tables with fewer columns for queries we run frequently for reporting.
By doing so, every time we are reporting we are only using the exact number of columns (data) we need to complete the job.
What Google Recommends
Not pulling in any excess columns causing increased costs. Going off only querying the columns of data one needs, Google recommends to not use the SELECT * option when running their queries.
Using SELECT * is the most expensive way to query data. When one uses SELECT *, BigQuery does a full scan of every column in the table.
Now one can still use the SELECT * option when running a query, Google recommends using SELECT * EXCEPT to exclude one or more columns from the results.
This will allow the analyst to select only the columns they need to query helping control costs.
Personally, we would not recommend using the SELECT * option when running queries in the event one mistakenly misses columns they wanted to exclude from the query.
For this reason as previously mentioned, we have taken the time out to create destination tables which we can sort by date ranges to only query the data we absolutely need to.
In addition to the ability to create destination tables to control costs, BigQuery has an awesome feature that allows analysts to price their queries before running them.
According to Google, queries are billed according to the number of bytes read. To estimate costs before running a query:
View the query validator in the Cloud Console or the classic web UI
Use the Google Cloud Platform Pricing Calculator
Perform a dry run by using the:
–dry_run flag in the CL
dryRun parameter when submitting a query job using the API
BigQuery also has the ability to use the maximum bytes billed setting to limit query costs. This can limit the number of bytes billed for a query using.
One can accomplish this by using the maximum bytes billed setting. When setting maximum bytes billed, if the query will read bytes beyond the limit, the query fails without incurring a charge.
If a query fails because of the maximum bytes billed setting, an error like the following is returned:
Error: Query exceeded limit for bytes billed: 1000000. 10485760 or higher required.
How Bigquery Has Helped Conduit
Here at Conduit, we were faced with a similar issue Google was when they set out on their journey to create BigQuery.
We hit a point where we needed our reports to be smoother and faster, and find a solution that could handle all our data from our 1,500 monthly clients.
This was not an easy feat, we needed a solution that was simplistic, reliable, affordable and not limited in its scaling and speed capabilities.
After months of researching, we narrowed our search down to BigQuery.
For us, BigQuery had time to value ratio we could not match with other applications, let alone when we compared it to our previous process.
At one point in time, we were pulling our client data into reports via Google and Excel sheets. This did not just pose an issue for us in its scaling capabilities when we hit a certain amount of data, it was as slow as molasses.
BigQuery solved that issue for us, with its pretty much limitless scaling capabilities as well as its ability to process petabytes of data in seconds!
Simplicity
Simplicity was another factor we valued heavily. We needed something we could get a grasp on and manage in house, helping us control costs and manipulate our data in a way we wanted.
BigQuery provides us with the infrastructure to not only store our data but transform it utilizing queries so that we can connect our data sources in as lightweight of a way as possible. Helping us control costs in the long run.
Scalability
Touching back on our need for scalability, with 1,500 clients monthly and most clients running multiple campaigns with us we were in need of an application that could scale with our tremendous growth.
BigQuery was designed with that in mind, having virtually no limitations on scalability with the combination of its low costs is a great price point for analytics data.
Speed
We were not just looking for a solution that was simple and had massive potential for scalability it also needed to be blazing fast to give us the ability to have real time reporting.
BigQuery’s serverless infrastructure provides us with the speeds we need to accomplish all of our reporting needs.
The infrastructure provided by BigQuery would be unreachable for most companies. Access to that type of computing power is not common and would cost companies an unfathomable amount of money to accomplish (a bit of an exaggeration but you get the point).
With BigQuery’s capability to run a 4.06TB query in roughly 24 seconds, our search for a solution that had the speed capabilities we required ended there.
Reliability
Reliability also played a massive role in our search. The need for a reliable infrastructure to eliminate the need for our own was top of mind. BigQuery provided us with exactly that.
According to Google, this is exactly how they accomplish their unmatched reliability, “BigQuery has a geographically diverse team of Site Reliability Engineers (SREs) who monitor the service 24/7 for outages, performance degradation, latency, and failures.
SREs track the service against internal SLOs, which are often much stricter than public SLAs. We are also able to help customers research not-so-obvious SQL issues.
The BigQuery team works behind the scenes to help ensure that you get the most current software stack running on fantastic infrastructure.
To that end, we may seamlessly migrate your queries to a different data center (while of course respecting the dataset location constraints you’ve set, e.g., if you’ve asked that it remains in Europe).
This means that your BigQuery queries may run in one data center in your region in the morning, and in another data center in the afternoon, as we roll out a new version of Dremel, upgrade networking or hardware or implement a new compression algorithm.”
Protection
Simplicity, speed, and reliability were not all we were looking for; we needed security above all.
With over 1,500 clients and all the data that comes with that volume, we needed to make sure our clients were protected.
BigQuery makes this possible by encrypting projects and datasets through Google’s cloud-wide identity and access management (IAM).
Conclusion
When it boils down to it all, cost is the overall deciding factor. Not only saving you money but also time.
BigQuerys transparent pricing allows us to easily calculate and forecast costs based on the number of reports we create and the queries we run.
On top of that, BigQuery allows 1 TB of data and store 10 GB of data for free each month.
Making the switch was a no brainer. With all of this, we hope to have brought some light to your current and stressful process.
If you are looking for a White-Label Digital Solution to help you lighten the load and make your current process more streamlined and stress-free, contact Conduit Digital.
We have the tools, knowledge, and expertise to bring your clients’ digital campaigns to the next level. See how Partnership with Conduit can help grow your agency to the moon.