What is the Microsoft PROSE Code Accelerator SDK for Python?
The Microsoft PROSE Code Accelerator SDK uses the power of PROSE to quickly and accurately generate Python code for common data preparation tasks. Code Accelerator includes functionality for data ingestion, data type correction, and pattern identification in string data.
Using Code Accelerator
The first thing to know about Code Accelerator is that it is a tool masquerading as an SDK. You call Code Accelerator from a Python interactive environment and it produces code for you. You can examine the results, adjust your call to Code Accelerator, and repeat as needed. But when you are done, you own the resulting code which has no dependencies on Code Accelerator. You can modify or extend the code, copy it into another system, or work with it however you need.
Code Accelerator contains a series of classes which all follow the same pattern:
Interactions with Code Accelerator start with a builder, which is the object that will build code for you. There are
builders for reading delimited files (
ReadCsvBuilder), reading fixed width files (
ReadFwfBuilder), reading JSON
ReadJsonBuilder), detecting data types (
DetectTypeBuilder) and finding patterns in strings
FindPatternsBuilder). In each case the interaction is the same:
init learn code [task] -----> Builder ------> LearnResult -----> printable lambda ^ | \ \ | | \ \----> output for task \---/ \ data clues \ \-----> metadata
- Create the builder supplying any absolutely required parameters.
- Optionally provide clues to help the builder.
- Review the resulting data and code.
- Return to step 2 and repeat as necessary until the results are correct.
- Take the code and use it independently from Code Accelerator--Code Accelerator itself is just a tool, it is not used at runtime.
NOTE: Code Accelerator is NOT intended for use at runtime. Use it when creating your code. Only use the code it generates at runtime.
About the code method on learn result
learn() is called on any Code Accelerator builder, PROSE synthesizes Python code for the specified task. This
code may be retrieved from the
code() method on the result object. The code takes the form of one or more functions
which may later be called with the same or similar data to what was initially provided to the builder to accomplish the
ReadCsvBuilder, for instance, will produce a
read_file function that takes a path to a file of the same
format as the file passed to the builder, while the
DetectTypesBuilder will produce a
coerce_types function that
takes a data frame of the same form as what was originally passed to the builder.
Calling the generated function with data having different schema may result in errors. For example, calling
coerce_types function with a data frame that has a different schema than what was used to generate the
function will likely fail or produce unwanted results.
About the data method on learn result
data() method is intended to give a quick look at what the generated code will do. For the input originally given
to the builder, it will show what the output of running the generated code would be. This way you can see if the code
does what you want, and if not, provide different or additional information to the builder and try learning again.
Since the provided data may be much larger than what you need for a quick check, the method takes an optional parameter of the number of output values to return. For some learn results this is a number of a rows, for others it is just a number of scalar values.
Working with PySpark
Each builder supports the
Target property which specifies the runtime environment for the generated code. By default
the generated code will use pandas, but if you set the
Target property to
"pyspark", then it will produce code
for that runtime instead.
Some things to keep in mind about PySpark:
- In order to target PySpark, you must first
pip install pyspark. Code Accelerator will dynamically use it if it is installed and you request to target it, but Code Accelerator will fail if you target it, and it isn't already installed.
- The generated code for PySpark does not
collect. Code Accelerator generates code that can fit into the beginning or middle of a Spark pipeline. This way you can mix Code Accelerator generated code with other code. Once you are done processing your data, you will need to manually call
collector whatever other operator to produce final results.
data()method, however, ensures that you get some data back to examine so you can determine if the generated code is correct or not. This may run a Spark job and collect the results, or it may produce the data another way. In any case the data is guaranteed to be the same data that the generated code will produce.